Skip to content
Snippets Groups Projects
Commit bdb000db authored by Riku-Laine's avatar Riku-Laine
Browse files

Fig updates, PO alg

parent 12b19e55
No related branches found
No related tags found
No related merge requests found
......@@ -171,20 +171,22 @@ Complex computational algorithms have already been deployed in numerous fields s
\section{Framework}
$$DG+DEC = DATA \rightarrow MODEL \rightarrow EVALUATION$$
$$FEATURES + DECISION = DATA \rightarrow MODEL \rightarrow EVALUATION$$
In our framework, the flow of information is divided into three steps: data generation, modelling and evaluation steps. Our framework definition relies on precise definitions on the properties of each step. A step is fully defined when its input and output are unambiguously described.
In our framework, the flow of information is divided into three steps: data, modelling and evaluation steps. Our framework definition relies on precise definitions on the properties of each step. A step is fully defined when its input and output are unambiguously described.
The data generation step generates all the data, including the features and the selectively labeled outcomes of the subjects and the features of the deciders. In our setting we discuss mainly five different variables: subjects have feature X for observed features, Z for unobserved features (or \emph{unobservables} in short), and Y for the outcome. The deciders (judges, doctors) assigning the labels are defined by their leniency R and the decisions given to subjects T. The central variants for feature and decision generation are presented in section \ref{sec:modules}. (we have separated the decision process from the feature generation) The effects of variables on each other is presented as a directed graph in figure \ref{fig:initial_model}.
The data step generates all the data. The data includes the features and the outcomes of the subjects, the features of the deciders and their decisions concerning the subjects. The features of the subjects include observable and unobservable features, denoted with X and Z respectively. The only feature of a decider is their leniency R. The decisions given will be denoted with T and the resulting outcomes with Y, 0 for negative and 1 for positive.
After the feature generation, the data is moved to a predictive model in the modelling step. The model, e.g. a regression model, is trained using one portion of the original data. Using the trained model we assign predictions for negative outcomes for the observations in the other part of the data. It is this model the performance of which we are interested in.
%In our setting we discuss mainly five different variables: subjects have feature X for observed features, Z for unobserved features (or \emph{unobservables} in short), and Y for the outcome. The deciders (judges, doctors) assigning the labels are defined by their leniency R and the decisions given to subjects T. The central variants for feature and decision generation are presented in section \ref{sec:modules}. (we have separated the decision process from the feature generation) The effects of variables on each other is presented as a directed graph in figure \ref{fig:initial_model}.
Finally, an evaluation algorithm tries to output a reliable estimate of the model's failure rate. A good evaluation algorithm gives a precise, unbiased and low variance estimate of the failure rate of the model with a given leniency $r$. As explained the setting is characterized by the \emph{selctive labeling} of the data.
Once the data is available, it is given to a predictive model in the modelling step. The model, e.g. a regression model, is trained using one portion of the original data. Using the trained model and the observed covariates we assign predictions for negative outcomes for the observations in the other part of the data. It is this model the performance of which we are interested in.
Finally, the data with the predictions is given to an evaluation algorithm which tries to output a reliable estimate of the model's failure rate. A good evaluation algorithm gives a precise, unbiased and low variance estimate of the failure rate of the model with a given leniency $r$. % As explained the setting is characterized by the \emph{selctive labeling} of the data.
\section{Potential outcomes in model evaluation}
Potential outcomes evaluator uses Stan to infer the latent variable and the path coefficients and to estimate expectation of Y for the missing outcomes. Full hierarchical model is presented in eq. \ref{eq:po} below. Expectations for the missing outcomes Y are obtained as \sout{point estimates from the maximum of the joint posterior} means of the posterior predictive distribution. (Joint posterior appeared to be bimodal or skewed and therefore the mean gave better estimates.)
Potential outcomes evaluator uses Stan to infer the latent variable and the path coefficients and to estimate expectation of the potential outcome Y(1) for the missing outcomes. Full hierarchical model is presented in eq.\ref{eq:po} below. Expectations for the missing outcomes Y are obtained as \sout{point estimates from the maximum of the joint posterior} means of the posterior predictive distribution. (Joint posterior appeared to be bimodal or skewed and therefore the mean gave better estimates.)
Priors for the $\beta$ coefficients were chosen to be sufficiently non-informative without restricting the density estimation. Coefficients for the latent Z were restricted to be positive for posterior estimation. The resulting distribution is the half-normal distribution ($X\sim N(0, 1) \Rightarrow Y=|X| \sim$ Half-Normal). The $\alpha$ intercepts are only for the decision variable to emulate the differences in leniency for each $M$ judge. Then subjects with equal x and z will have a different probability for bail given the judges leniency.
......@@ -197,11 +199,11 @@ Priors for the $\beta$ coefficients were chosen to be sufficiently non-informati
p(\alpha_j) & \propto 1 \hskip1.0em \text{for } j \in \{1, 2, \ldots, M\}
\end{align}
Model is fitted on the test set and the expectations of potential outcomes $Y_{T=1}$ are used in place of missing outcomes.
Model is fitted on the test set and the expectations of potential outcomes $Y(1)$ are used in place of missing outcomes.
\section{Results}
Here we present the results obtained from four core settings and four settings created for robustness check. Our core settings include situations where data is generated with the unobservables Z. Settings are characterized by the outcome creating mechanism and decision assignment.
Here we present the results obtained from four core settings \sout{and four settings created for robustness check}. Our core settings include situations where data is generated with the unobservables Z. Settings are characterized by the outcome creating mechanism and decision assignment.
The four main result figures are presented in figure set \ref{fig:results_bayes} with captions indicating the outcome and decision mechanisms. Figures showing the variance of the failure rate estimates are in appendix section \ref{sec:diagnostic}.
......@@ -235,7 +237,7 @@ The four main result figures are presented in figure set \ref{fig:results_bayes}
\label{fig:results_bayes}
\end{figure}
The robustness checks are presented in figure set \ref{fig:results_robustness} depicting deciders who
The robustness checks (TBA) are presented in figure set \ref{fig:results_robustness} depicting deciders who
\begin{enumerate}[label=\Alph*)]
\item assign bail randomly with probability $r$,
\item who favour and dislike some defendants with certain values of X
......@@ -243,43 +245,43 @@ The robustness checks are presented in figure set \ref{fig:results_robustness} d
\end{enumerate}
Last figure (D) is to illustrate the situation where the data generation mechanism is exactly the same as in model specified in eq. \ref{eq:po} with $\beta$ coefficients equal to 1 and $\alpha$ coefficients equal to 0.
The figures show that the contraction method is fairly robust to changes in data and decision generation although it has some variance. The potential outcomes approach consistently underestimates the true failure rate resulting in mediocre performance (compared to the imputation algorithms presented in the SL paper). The mean absolute errors of contraction with regard to true evaluation were in the order of 0.009...0.005 compared to our MAEs of approximately 0.015...0.035. Worrying aspect of the analysis is that even if the data generating process follows exactly the specified hierarchical model, the failure rate is still approximately 0.015.
Notable is that the diagnostic figures in section \ref{sec:diagnostic} show that few failure rate estimates are too high showing as "flyers/outliers" implying problems in model identifiability, probably as in bimodal posterior. (In the output when Stan tries to find the minimum of the negative log posterior (max of posterior) it occasionally converges to log probability of approximately -200...-400 when most of the times the minimum is at -7000...-9000.)
% The figures show that the potential outcomes approach estimates the true failure rate with a slightly better performance than the contraction algorithm (MAEs 0.00X...0.00Y compared to 0.00T...0.00U respectively). The results are in for only some of the core situations discussed, but will be provided for the Thursday meeting. The diagnostic figures in section \ref{sec:diagnostic} also show that the predictions given by the potential outcomes approach have a lower variance.
Imposing constraints to model in equation \ref{eq:po} did not yield significantly better results. The model was constrained so that there was only one intercept $\alpha$ and so that $\beta_{xt}=\beta_{xy}$ and $\beta_{zt}=\beta_{zy}$. Other caveats of the current approach is its scalability to large data sets.
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.475\textwidth}
\includegraphics[width=\textwidth]{sl_result_random}
\caption{Random decisions.}
%\label{fig:}
\end{subfigure}
\quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc.
%(or a blank line to force the subfigure onto a new line)
\begin{subfigure}[b]{0.475\textwidth}
\includegraphics[width=\textwidth]{sl_result_biased}
\caption{Biased decisions.}
\label{fig:bias}
\end{subfigure}
\begin{subfigure}[b]{0.475\textwidth}
\includegraphics[width=\textwidth]{sl_result_bad}
\caption{Bad judge with $\beta_x=0.2$.}
%\label{fig:}
\end{subfigure}
\quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc.
%(or a blank line to force the subfigure onto a new line)
\begin{subfigure}[b]{0.475\textwidth}
\includegraphics[width=\textwidth]{sl_result_bernoulli_bernoulli}
\caption{Data generation as corresponding to model.}
%\label{fig:}
\end{subfigure}
\caption{Robustness check figures: Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation mechanisms. Only one data set used, which affects the performance of contraction.}
\label{fig:results_robustness}
\end{figure}
%The figures show that the contraction method is fairly robust to changes in data and decision generation although it has some variance. The potential outcomes approach consistently underestimates the true failure rate resulting in mediocre performance (compared to the imputation algorithms presented in the SL paper). The mean absolute errors of contraction with regard to true evaluation were in the order of 0.009...0.005 compared to our MAEs of approximately 0.015...0.035. Worrying aspect of the analysis is that even if the data generating process follows exactly the specified hierarchical model, the failure rate is still approximately 0.015.
%Notable is that the diagnostic figures in section \ref{sec:diagnostic} show that few failure rate estimates are too high showing as "flyers/outliers" implying problems in model identifiability, probably as in bimodal posterior. (In the output when Stan tries to find the minimum of the negative log posterior (max of posterior) it occasionally converges to log probability of approximately -200...-400 when most of the times the minimum is at -7000...-9000.)
The figures show that the contraction method is fairly robust to changes in data and decision generation although its estimates have some variance. The figures also show that the potential outcomes approach estimates the true failure rate with a slightly better performance than the contraction algorithm (MAEs 0.00164 and 0.00165 compared to approximately 0.003...0.005 respectively). The results are in for only three of the core situations discussed figs (2a-c), but will be provided for the Thursday meeting. The diagnostic figures in section \ref{sec:diagnostic} also show that the predictions given by the potential outcomes approach have a lower variance.
Imposing constraints to model in equation \ref{eq:po} might yield better results. The model could be constrained so that there was only one intercept $\alpha$ and so that $\beta_{xt}=\beta_{xy}$ and $\beta_{zt}=\beta_{zy}$. A caveat of the current approach is its scalability to large data sets (multi-dimensionality).
%\begin{figure}[H]
% \centering
% \begin{subfigure}[b]{0.475\textwidth}
% \includegraphics[width=\textwidth]{sl_result_random}
% \caption{Random decisions.}
% %\label{fig:}
% \end{subfigure}
% \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc.
% %(or a blank line to force the subfigure onto a new line)
% \begin{subfigure}[b]{0.475\textwidth}
% \includegraphics[width=\textwidth]{sl_result_biased}
% \caption{Biased decisions.}
% \label{fig:bias}
% \end{subfigure}
% \begin{subfigure}[b]{0.475\textwidth}
% \includegraphics[width=\textwidth]{sl_result_bad}
% \caption{Bad judge with $\beta_x=0.2$.}
% %\label{fig:}
% \end{subfigure}
% \quad %add desired spacing between images, e. g. ~, \quad, \qquad, \hfill etc.
% %(or a blank line to force the subfigure onto a new line)
% \begin{subfigure}[b]{0.475\textwidth}
% \includegraphics[width=\textwidth]{sl_result_bernoulli_bernoulli}
% \caption{Data generation as corresponding to model.}
% %\label{fig:}
% \end{subfigure}
% \caption{Robustness check figures: Failure rate vs. acceptance rate with varying levels of leniency. Different combinations of deciders and data generation mechanisms. Only one data set used, which affects the performance of contraction.}
% \label{fig:results_robustness}
%\end{figure}
\begin{thebibliography}{9} % Might have been apa
......
figures/sl_diagnostic_bernoulli_batch_with_Z.png

99.2 KiB | W: | H:

figures/sl_diagnostic_bernoulli_batch_with_Z.png

92.7 KiB | W: | H:

figures/sl_diagnostic_bernoulli_batch_with_Z.png
figures/sl_diagnostic_bernoulli_batch_with_Z.png
figures/sl_diagnostic_bernoulli_batch_with_Z.png
figures/sl_diagnostic_bernoulli_batch_with_Z.png
  • 2-up
  • Swipe
  • Onion skin
figures/sl_diagnostic_threshold_independent_with_Z.png

93.8 KiB | W: | H:

figures/sl_diagnostic_threshold_independent_with_Z.png

94.5 KiB | W: | H:

figures/sl_diagnostic_threshold_independent_with_Z.png
figures/sl_diagnostic_threshold_independent_with_Z.png
figures/sl_diagnostic_threshold_independent_with_Z.png
figures/sl_diagnostic_threshold_independent_with_Z.png
  • 2-up
  • Swipe
  • Onion skin
figures/sl_result_bernoulli_batch.png

63.9 KiB | W: | H:

figures/sl_result_bernoulli_batch.png

56 KiB | W: | H:

figures/sl_result_bernoulli_batch.png
figures/sl_result_bernoulli_batch.png
figures/sl_result_bernoulli_batch.png
figures/sl_result_bernoulli_batch.png
  • 2-up
  • Swipe
  • Onion skin
figures/sl_result_threshold_independent.png

64.7 KiB | W: | H:

figures/sl_result_threshold_independent.png

61.5 KiB | W: | H:

figures/sl_result_threshold_independent.png
figures/sl_result_threshold_independent.png
figures/sl_result_threshold_independent.png
figures/sl_result_threshold_independent.png
  • 2-up
  • Swipe
  • Onion skin
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment