Skip to content
Snippets Groups Projects
Analysis_07MAY2019_new.ipynb 173 KiB
Newer Older
Riku-Laine's avatar
Riku-Laine committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
Riku-Laine's avatar
Riku-Laine committed
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Data-sets\" data-toc-modified-id=\"Data-sets-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Data sets</a></span><ul class=\"toc-item\"><li><span><a href=\"#Synthetic-data-with-unobservables\" data-toc-modified-id=\"Synthetic-data-with-unobservables-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Synthetic data with unobservables</a></span></li><li><span><a href=\"#Data-without-unobservables\" data-toc-modified-id=\"Data-without-unobservables-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Data without unobservables</a></span></li></ul></li><li><span><a href=\"#Algorithms\" data-toc-modified-id=\"Algorithms-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Algorithms</a></span><ul class=\"toc-item\"><li><span><a href=\"#Contraction-algorithm\" data-toc-modified-id=\"Contraction-algorithm-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>Contraction algorithm</a></span></li><li><span><a href=\"#Causal-approach---metrics\" data-toc-modified-id=\"Causal-approach---metrics-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>Causal approach - metrics</a></span></li></ul></li><li><span><a href=\"#Performance-comparison\" data-toc-modified-id=\"Performance-comparison-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Performance comparison</a></span><ul class=\"toc-item\"><li><span><a href=\"#With-unobservables-in-the-data\" data-toc-modified-id=\"With-unobservables-in-the-data-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>With unobservables in the data</a></span></li><li><span><a href=\"#Without-unobservables\" data-toc-modified-id=\"Without-unobservables-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>Without unobservables</a></span></li></ul></li></ul></div>"
Riku-Laine's avatar
Riku-Laine committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Riku-Laine's avatar
Riku-Laine committed
    "<!-- ##  Causal model\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "Our model is defined by the probabilistic expression \n",
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "\\begin{equation} \\label{model_disc}\n",
Riku-Laine's avatar
Riku-Laine committed
    "P(Y=0 | \\text{do}(R=r)) = \\sum_x \\underbrace{P(Y=0|X=x, T=1)}_\\text{1} \n",
    "\\overbrace{P(T=1|R=r, X=x)}^\\text{2} \n",
    "\\underbrace{P(X=x)}_\\text{3}\n",
    "\\end{equation}\n",
    "\n",
    "which is equal to \n",
    "\n",
    "\\begin{equation}\\label{model_cont}\n",
    "P(Y=0 | \\text{do}(R=r)) = \\int_x P(Y=0|X=x, T=1)P(T=1|R=r, X=x)P(X=x)\n",
    "\\end{equation}\n",
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "for continuous $x$. In the model Z is a latent, unobserved variable, and can be excluded from the expression with do-calculus by showing that $X$ is admissible for adjustment. Model as a graph:\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "![Model as picture](../figures/intervention_model.png \"Intervention model\")\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "For predicting the probability of negative outcome the following should hold because by Pearl $P(Y=0 | \\text{do}(R=r), X=x) = P(Y=0 | R=r, X=x)$ when $X$ is an admissible set:\n",
    "\n",
    "\\begin{equation} \\label{model_pred}\n",
    "P(Y=0 | \\text{do}(R=r), X=x) = P(Y=0|X=x, T=1)P(T=1|R=r, X=x).\n",
    "\\end{equation}\n",
    "\n",
    "Still it should be noted that this prediction takes into account the probability of the individual to be given a positive decision ($T=1$), see second term in \\ref{model_pred}.\n",
    "\n",
    "----\n",
    "\n",
    "### Notes\n",
    "\n",
    "* Equations \\ref{model_disc} and \\ref{model_cont} describe the whole causal effect in the population (the causal effect of changing $r$ over all strata $X$).\n",
    "* Prediction should be possible with \\ref{model_pred}. Both terms can be learned from the data. NB: the probability $P(Y=0 | \\text{do}(R=r), X=x)$ is lowest when the individual $x$ is the most dangerous or the least dangerous. How could we infer/predict the counterfactual \"what is the probability of $Y=0$ if we were to let this individual go?\" has yet to be calculated.\n",
Riku-Laine's avatar
Riku-Laine committed
    "* Is the effect of R learned/estimated correctly if it is just plugged in to a predictive model (e.g. logistic regression)? **NO**\n",
Riku-Laine's avatar
Riku-Laine committed
    "* $P(Y=0 | do(R=0)) = 0$ only in this application. <!-- My predictive models say that when $r=0$ the probability $P(Y=0) \\approx 0.027$ which would be a natural estimate in another application/scenario (e.g. in medicine the probability of an adverse event when a stronger medicine is distributed to everyone. Then the probability will be close to zero but not exactly zero.) -->\n",
    "\n",
    "Imports and settings."
Riku-Laine's avatar
Riku-Laine committed
   ]
  },
  {
   "cell_type": "code",
Riku-Laine's avatar
Riku-Laine committed
   "execution_count": 1,
Riku-Laine's avatar
Riku-Laine committed
   "metadata": {},
Riku-Laine's avatar
Riku-Laine committed
   "outputs": [],
Riku-Laine's avatar
Riku-Laine committed
   "source": [
    "# Imports\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from datetime import datetime\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.stats as scs\n",
    "import scipy.integrate as si\n",
    "import seaborn as sns\n",
    "import numpy.random as npr\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "# Settings\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "plt.rcParams.update({'font.size': 20})\n",
Riku-Laine's avatar
Riku-Laine committed
    "plt.rcParams.update({'figure.figsize': (14, 7)})\n",
    "\n",
    "# Suppress deprecation warnings.\n",
    "\n",
    "import warnings\n",
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "def fxn():\n",
    "    warnings.warn(\"deprecated\", DeprecationWarning)\n",
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "with warnings.catch_warnings():\n",
    "    warnings.simplefilter(\"ignore\")\n",
    "    fxn()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Riku-Laine's avatar
Riku-Laine committed
    "## Data sets\n",
    "\n",
    "### Synthetic data with unobservables\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "In the chunk below, we generate the synthetic data as described by Lakkaraju et al. The default values and definitions of $Y$ and $T$ values follow their description.\n",
    "\n",
    "**Parameters**\n",
    "\n",
    "* M = `nJudges_M`, number of judges\n",
    "* N = `nSubjects_N`, number of subjects assigned to each judge\n",
    "* betas $\\beta_i$ = `beta_i`, where $i \\in \\{X, Z, W\\}$ are coefficients for the respected variables\n",
    "\n",
    "**Columns of the data:**\n",
    "\n",
    "* `judgeID_J` = judge IDs as running numbering from 0 to `nJudges_M - 1`\n",
    "* R = `acceptanceRate_R`, acceptance rates\n",
    "* X = `X`, invidual's features observable to all (models and judges)\n",
    "* Z = `Z`, information observable for judges only\n",
    "* W = `W`, unobservable / inaccessible information\n",
    "* T = `decision_T`, bail-or-jail decisions where $T=0$ represents jail decision and $T=1$ bail decision.\n",
Riku-Laine's avatar
Riku-Laine committed
    "* Y = `result_Y`, result variable, if $Y=0$ person will or would recidivate and if $Y=1$ person will or would not commit a crime.\n",
    "\n",
    "The generated data will have M\\*N rows."
Riku-Laine's avatar
Riku-Laine committed
   ]
  },
  {
   "cell_type": "code",
Riku-Laine's avatar
Riku-Laine committed
   "execution_count": 2,
Riku-Laine's avatar
Riku-Laine committed
   "metadata": {},
Riku-Laine's avatar
Riku-Laine committed
   "outputs": [],
Riku-Laine's avatar
Riku-Laine committed
   "source": [
Riku-Laine's avatar
Riku-Laine committed
    "def sigmoid(x):\n",
Riku-Laine's avatar
Riku-Laine committed
    "    '''Return value of sigmoid function (inverse of logit) at x.'''\n",
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "    return 1 / (1 + np.exp(-1*x))\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "def dataWithUnobservables(nJudges_M=100,\n",
    "                          nSubjects_N=500,\n",
    "                          beta_X=1.0,\n",
    "                          beta_Z=1.0,\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "    df = pd.DataFrame()\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "    # Assign judge IDs as running numbering from 0 to nJudges_M - 1\n",
Riku-Laine's avatar
Riku-Laine committed
    "    df = df.assign(judgeID_J=np.repeat(range(0, nJudges_M), nSubjects_N))\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "    # Sample acceptance rates uniformly from a closed interval\n",
    "    # from 0.1 to 0.9 and round to tenth decimal place.\n",
    "    acceptance_rates = np.round(npr.uniform(.1, .9, nJudges_M), 10)\n",
    "\n",
    "    # Replicate the rates so they can be attached to the corresponding judge ID.\n",
Riku-Laine's avatar
Riku-Laine committed
    "    df = df.assign(acceptanceRate_R=np.repeat(acceptance_rates, nSubjects_N))\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "    # Sample the variables from standard Gaussian distributions.\n",
Riku-Laine's avatar
Riku-Laine committed
    "    df = df.assign(X=npr.normal(size=nJudges_M * nSubjects_N))\n",
    "    df = df.assign(Z=npr.normal(size=nJudges_M * nSubjects_N))\n",
    "    df = df.assign(W=npr.normal(size=nJudges_M * nSubjects_N))\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "    # Calculate P(Y=0|X, Z, W)\n",
Riku-Laine's avatar
Riku-Laine committed
    "    probabilities_Y = sigmoid(beta_X * df.X + beta_Z * df.Z + beta_W * df.W)\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "    df = df.assign(probabilities_Y=probabilities_Y)\n",
    "\n",
    "    # Result is 0 if P(Y = 0| X = x; Z = z; W = w) >= 0.5 , 1 otherwise\n",
Riku-Laine's avatar
Riku-Laine committed
    "    df = df.assign(result_Y=np.where(df.probabilities_Y >= 0.5, 0, 1))\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "    # For the conditional probabilities of T we add noise ~ N(0, 0.1)\n",
Riku-Laine's avatar
Riku-Laine committed
    "    probabilities_T = sigmoid(beta_X * df.X + beta_Z * df.Z)\n",
    "    probabilities_T += np.sqrt(0.1) * npr.normal(size=nJudges_M * nSubjects_N)\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "    df = df.assign(probabilities_T=probabilities_T)\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "    # Sort by judges then probabilities in decreasing order\n",
Riku-Laine's avatar
Riku-Laine committed
    "    # Most dangerous for each judge are at the top.\n",
    "    df.sort_values(by=[\"judgeID_J\", \"probabilities_T\"],\n",
    "                   ascending=False,\n",
    "                   inplace=True)\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "    # Iterate over the data. Subject will be given a negative decision\n",
    "    # if they are in the top (1-r)*100% of the individuals the judge will judge.\n",
    "    # I.e. if their within-judge-index is under 1 - acceptance threshold times\n",
    "    # the number of subjects assigned to each judge they will receive a\n",
    "    # negative decision.\n",
Riku-Laine's avatar
Riku-Laine committed
    "    df.reset_index(drop=True, inplace=True)\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "    df['decision_T'] = np.where((df.index.values % nSubjects_N) <\n",
    "                                ((1 - df['acceptanceRate_R']) * nSubjects_N),\n",
    "                                0, 1)\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "    # Halve the data set to test and train\n",
Riku-Laine's avatar
Riku-Laine committed
    "    train, test = train_test_split(df, test_size=0.5)\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "    train_labeled = train.copy()\n",
    "    test_labeled = test.copy()\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
    "    # Set results as NA if decision is negative.\n",
    "    train_labeled.loc[train_labeled.decision_T == 0, 'result_Y'] = np.nan\n",
Riku-Laine's avatar
Riku-Laine committed
    "    test_labeled.loc[test_labeled.decision_T == 0, 'result_Y'] = np.nan\n",
Riku-Laine's avatar
Riku-Laine committed
    "\n",
Riku-Laine's avatar
Riku-Laine committed
    "    return train_labeled, train, test_labeled, test, df"
Riku-Laine's avatar
Riku-Laine committed
 
Loading
Loading full blame...