Skip to content
Snippets Groups Projects
Bachelors_thesis_analyses.ipynb 530 KiB
Newer Older
Riku-Laine's avatar
Riku-Laine committed
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Data-sets\" data-toc-modified-id=\"Data-sets-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Data sets</a></span><ul class=\"toc-item\"><li><span><a href=\"#COMPAS-data\" data-toc-modified-id=\"COMPAS-data-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>COMPAS data</a></span></li><li><span><a href=\"#Synthetic-data\" data-toc-modified-id=\"Synthetic-data-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Synthetic data</a></span></li></ul></li><li><span><a href=\"#Algorithms\" data-toc-modified-id=\"Algorithms-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Algorithms</a></span><ul class=\"toc-item\"><li><span><a href=\"#Contraction-algorithm\" data-toc-modified-id=\"Contraction-algorithm-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>Contraction algorithm</a></span></li><li><span><a href=\"#Causal-model\" data-toc-modified-id=\"Causal-model-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>Causal model</a></span></li></ul></li><li><span><a href=\"#Performance-comparison\" data-toc-modified-id=\"Performance-comparison-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Performance comparison</a></span><ul class=\"toc-item\"><li><span><a href=\"#On-synthetic-data\" data-toc-modified-id=\"On-synthetic-data-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>On synthetic data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Predictive-models\" data-toc-modified-id=\"Predictive-models-3.1.1\"><span class=\"toc-item-num\">3.1.1&nbsp;&nbsp;</span>Predictive models</a></span></li><li><span><a href=\"#Visual-comparison\" data-toc-modified-id=\"Visual-comparison-3.1.2\"><span class=\"toc-item-num\">3.1.2&nbsp;&nbsp;</span>Visual comparison</a></span></li></ul></li><li><span><a href=\"#On-COMPAS-data\" data-toc-modified-id=\"On-COMPAS-data-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>On COMPAS data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Predictive-models\" data-toc-modified-id=\"Predictive-models-3.2.1\"><span class=\"toc-item-num\">3.2.1&nbsp;&nbsp;</span>Predictive models</a></span></li></ul></li></ul></li></ul></div>"
Riku-Laine's avatar
Riku-Laine committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bachelors thesis' analyses\n",
    "\n",
    "*This Jupyter notebook is for the analyses and model building for Riku Laine's bachelors thesis*\n",
    "\n",
    "Table of contents is provided above. First I will briefly present the COMPAS data set and then create the synthetic data set as done by Lakkaraju *et al.* ([link](https://helka.finna.fi/PrimoRecord/pci.acm3098066)). Then I will proceed to implement algorithms. Finally I will do the side-by-side comparisons of the results on the synthetic data. Finally I run the causal model on the COMPAS data.\n",
    "## Data sets\n",
    "\n",
    "*Below I load the COMPAS data set and generate the synthetic one.*\n",
    "\n",
    "### COMPAS data\n",
    "The following data filtering procedure follows the one described in the [ProPublica methodology](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(7214, 53)\n",
      "['id' 'name' 'first' 'last' 'compas_screening_date' 'sex' 'dob' 'age'\n",
      " 'age_cat' 'race' 'juv_fel_count' 'decile_score' 'juv_misd_count'\n",
      " 'juv_other_count' 'priors_count' 'days_b_screening_arrest' 'c_jail_in'\n",
      " 'c_jail_out' 'c_case_number' 'c_offense_date' 'c_arrest_date'\n",
      " 'c_days_from_compas' 'c_charge_degree' 'c_charge_desc' 'is_recid'\n",
      " 'r_case_number' 'r_charge_degree' 'r_days_from_arrest' 'r_offense_date'\n",
      " 'r_charge_desc' 'r_jail_in' 'r_jail_out' 'violent_recid'\n",
      " 'is_violent_recid' 'vr_case_number' 'vr_charge_degree' 'vr_offense_date'\n",
      " 'vr_charge_desc' 'type_of_assessment' 'decile_score.1' 'score_text'\n",
      " 'screening_date' 'v_type_of_assessment' 'v_decile_score' 'v_score_text'\n",
      " 'v_screening_date' 'in_custody' 'out_custody' 'priors_count.1' 'start'\n",
      " 'end' 'event' 'two_year_recid']\n"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from datetime import datetime\n",
    "import matplotlib.pyplot as plt\n",
Riku-Laine's avatar
Riku-Laine committed
    "import scipy.stats as scs\n",
    "import seaborn as sns\n",
    "import numpy.random as npr\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "%matplotlib inline\n",
    "plt.rcParams.update({'font.size': 16})\n",
    "plt.rcParams.update({'figure.figsize': (14, 7)})\n",
    "\n",
    "# Read file\n",
    "compas_raw = pd.read_csv(\"../data/compas-scores-two-years.csv\")\n",
    "\n",
    "# Check dimensions, number of rows should be 7214\n",
    "print(compas_raw.shape)\n",
    "print(compas_raw.columns.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6172, 13)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select columns\n",
Riku-Laine's avatar
Riku-Laine committed
    "compas = compas_raw[[\n",
    "    'age', 'c_charge_degree', 'race', 'age_cat', 'score_text', 'sex',\n",
    "    'priors_count', 'days_b_screening_arrest', 'decile_score', 'is_recid',\n",
    "    'two_year_recid', 'c_jail_in', 'c_jail_out'\n",
    "]]\n",
    "# Subset values, see reasons in ProPublica methodology.\n",
    "compas = compas.query('days_b_screening_arrest <= 30 and \\\n",
    "                      days_b_screening_arrest >= -30 and \\\n",
    "                      is_recid != -1 and \\\n",
    "                      c_charge_degree != \"O\"')\n",
    "\n",
    "# Drop row if score_text is na\n",
    "compas = compas[compas.score_text.notnull()]\n",
    "\n",
    "compas.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>age</th>\n",
       "      <td>69</td>\n",
       "      <td>34</td>\n",
       "      <td>24</td>\n",
       "      <td>44</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c_charge_degree</th>\n",
       "      <td>F</td>\n",
       "      <td>F</td>\n",
       "      <td>F</td>\n",
       "      <td>M</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>race</th>\n",
       "      <td>Other</td>\n",
       "      <td>African-American</td>\n",
       "      <td>African-American</td>\n",
       "      <td>Other</td>\n",
       "      <td>Caucasian</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>age_cat</th>\n",
       "      <td>Greater than 45</td>\n",
       "      <td>25 - 45</td>\n",
       "      <td>Less than 25</td>\n",
       "      <td>25 - 45</td>\n",
       "      <td>25 - 45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>score_text</th>\n",
       "      <td>Low</td>\n",
       "      <td>Low</td>\n",
       "      <td>Low</td>\n",
       "      <td>Low</td>\n",
       "      <td>Medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sex</th>\n",
       "      <td>Male</td>\n",
       "      <td>Male</td>\n",
       "      <td>Male</td>\n",
       "      <td>Male</td>\n",
       "      <td>Male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>priors_count</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>days_b_screening_arrest</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>decile_score</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>is_recid</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two_year_recid</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c_jail_in</th>\n",
       "      <td>2013-08-13 06:03:42</td>\n",
       "      <td>2013-01-26 03:45:27</td>\n",
       "      <td>2013-04-13 04:58:34</td>\n",
       "      <td>2013-11-30 04:50:18</td>\n",
       "      <td>2014-02-18 05:08:24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c_jail_out</th>\n",
       "      <td>2013-08-14 05:41:20</td>\n",
       "      <td>2013-02-05 05:36:53</td>\n",
       "      <td>2013-04-14 07:02:04</td>\n",
       "      <td>2013-12-01 12:28:56</td>\n",
       "      <td>2014-02-24 12:18:30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>length_of_stay</th>\n",
       "      <td>0</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           0                    1  \\\n",
       "age                                       69                   34   \n",
       "c_charge_degree                            F                    F   \n",
       "race                                   Other     African-American   \n",
       "age_cat                      Greater than 45              25 - 45   \n",
       "score_text                               Low                  Low   \n",
       "sex                                     Male                 Male   \n",
       "priors_count                               0                    0   \n",
       "days_b_screening_arrest                   -1                   -1   \n",
       "decile_score                               1                    3   \n",
       "is_recid                                   0                    1   \n",
       "two_year_recid                             0                    1   \n",
       "c_jail_in                2013-08-13 06:03:42  2013-01-26 03:45:27   \n",
       "c_jail_out               2013-08-14 05:41:20  2013-02-05 05:36:53   \n",
       "length_of_stay                             0                   10   \n",
       "\n",
Loading
Loading full blame...