Looking at the simulation setting.

1aa557cd · Antti Hyttinen · 2adce82a · 1aa557cd
Commit 1aa557cd authored 5 years ago by Antti Hyttinen
--- a/paper/sl.tex
+++ b/paper/sl.tex
@@ -353,22 +353,18 @@ In this section we present our results from experiments with synthetic and reali

 \subsection{Synthetic data}

- (RL: I presume MM's preferences were that the outcome would be from Bernoulli distribution and that the decisions would be independent. So, let's first explain those ways thoroughly and then mention what we changed as discussed.)
+\rcomment{ I presume MM's preferences were that the outcome would be from Bernoulli distribution and that the decisions would be independent. So, let's first explain those ways thoroughly and then mention what we changed as discussed.}

-\begin{itemize}
-\item Data generation
-	\begin{itemize}
-	\item We experimented with synthetic data sets to show that our method is accurate, unbiased and robust to violations of the assumptions.
-	\item We imitated the data generation process presented by Lakkaraju et al. for benchmarking purposes.
-	\item We created data by sampling N=50k observations from three independent standard gaussians.
-	\item The observations were assigned to variables X, Z, and W.
-	\item We then drew the outcome Y from a Bernoulli distribution with parameter $p = 1 - \invlogit(\beta_xx+\beta_zz+\beta_ww)$ so that $\prob{Y=0|X, Z, W} =  \invlogit(\beta_xx+\beta_zz+\beta_ww)$ where the coefficients for X, Z and W were set to 1, 1, 0.2 respectively. 
-	\item This is one data generation module. It can be / was modified by changing the outcome producing mechanism. For other exeriments we changed the outcome generating mechanism so that the outcome was assigned value 1 if 
-	\item Next, the decisions were assigned by computing the quantile the subject belongs to. The quantile was obtained as the inverse cdf of ... . 
-	\item This way the observations were independent and the still the leniency would be a good estimate of the acceptance rate. (The acceptance rate would stochastically converge to the leniency.)
-	\item This is a decider module. We experimented with different combinations of decider and data generating modules to show X / see Y. (to see that our method is robust against non-informative, biased and bad decisions . Due to space constraints we defer these results...)
-	\end{itemize}
-\item Algorithms \\
+We experimented with synthetic data sets to examine accurateness, unbiasedness and robustness to violations of the assumptions. 
+
+We sampled $N=50k$ samples of  $X$, $Z$, and $W$ as independent standard Gaussians.  We then drew the outcome $Y$ from a Bernoulli distribution with parameter $p = 1 - \invlogit(\beta_xx+\beta_zz+\beta_ww)$ so that $P(Y=0|X, Z, W) =  \invlogit(\beta_xx+\beta_zz+\beta_ww)$ where the coefficients for X, Z and W were set to $1$, $1$, $0.2$ respectively.  This process follows the suggestion of Lakkaraju et al. \cite{lakkaraju2017selective}.
+
+%This is one data generation module.
+% It can be / was modified by changing the outcome producing mechanism. For other experiments we changed the outcome generating mechanism so that the outcome was assigned value 1 if
+We used a number of different decision mechanism in our simulations. The decisions were assigned by computing the quantile the subject belongs to. The quantile was obtained as the inverse cdf of ... . This way the observations were independent and the still the leniency would be a good estimate of the acceptance rate. (The acceptance rate would stochastically converge to the leniency.)
+ This is a decider module. We experimented with different combinations of decider and data generating modules to show X / see Y. (to see that our method is robust against non-informative, biased and bad decisions . Due to space constraints we defer these results...)
+
+\paragraph{Algorithms} 
 	We deployed multiple evaluator modules to estimate the true failure rate of the decider module. The estimates should be close to the true evaluation evaluator modules estimates and the estimates will eventually be compared to the human evaluation curve. 
 	\begin{itemize}
 	\item True evaluation
@@ -430,13 +426,13 @@ In this section we present our results from experiments with synthetic and reali
 		\item Estimates for the counterfactuals Y(1) for the unobserved values of Y were obtained using the posterior expectations from Stan. We used the NUTS sampler to estimate the posterior. When the values for...
 		\end{itemize}
 	\end{itemize}
-\item Results 
+\paragraph{Results} 
 (Target for this section from problem formulation: show that our evaluator is unbiased/accurate (show mean absolute error), robust to changes in data generation (some table perhaps, at least should discuss situations when the decisions are bad/biased/random = non-informative or misleading), also if the decider in the modelling step is bad and its information is used as input, what happens.)
 	\begin{itemize}
 	\item Accuracy: we have defined two metrics, acceptance rate and failure rate. In this section we show that our method can accurately restore the true failure on all acceptance rates with low mean absolute error. As figure X shows are method can recover the true performance of the predictive model with good accuracy. The mean absolute errors w.r.t the true evaluation were 0.XXX and 0.XXX for contraction and CBI approach respectively. 
 	\item In figure X we also present how are method can track the true evaluation curve with a low variance.
 	\end{itemize}
-\end{itemize}
+%\end{itemize}

 \subsection{Realistic data}
 In this section we present results from experiments with (realistic) data sets.