Skip to content
Snippets Groups Projects
Bachelors_thesis_analyses.ipynb 671 KiB
Newer Older
  • Learn to ignore specific revisions
  • Riku-Laine's avatar
    Riku-Laine committed
      {
       "cell_type": "markdown",
       "metadata": {
        "toc": true
       },
       "source": [
        "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    
        "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Data-sets\" data-toc-modified-id=\"Data-sets-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Data sets</a></span><ul class=\"toc-item\"><li><span><a href=\"#COMPAS-data\" data-toc-modified-id=\"COMPAS-data-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>COMPAS data</a></span></li><li><span><a href=\"#Synthetic-data\" data-toc-modified-id=\"Synthetic-data-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Synthetic data</a></span></li></ul></li><li><span><a href=\"#Algorithms\" data-toc-modified-id=\"Algorithms-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Algorithms</a></span><ul class=\"toc-item\"><li><span><a href=\"#Contraction-algorithm\" data-toc-modified-id=\"Contraction-algorithm-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>Contraction algorithm</a></span></li><li><span><a href=\"#Causal-model\" data-toc-modified-id=\"Causal-model-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>Causal model</a></span></li></ul></li><li><span><a href=\"#Performance-comparison\" data-toc-modified-id=\"Performance-comparison-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Performance comparison</a></span><ul class=\"toc-item\"><li><span><a href=\"#On-synthetic-data\" data-toc-modified-id=\"On-synthetic-data-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>On synthetic data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Predictive-models\" data-toc-modified-id=\"Predictive-models-3.1.1\"><span class=\"toc-item-num\">3.1.1&nbsp;&nbsp;</span>Predictive models</a></span></li><li><span><a href=\"#Visual-comparison\" data-toc-modified-id=\"Visual-comparison-3.1.2\"><span class=\"toc-item-num\">3.1.2&nbsp;&nbsp;</span>Visual comparison</a></span></li></ul></li><li><span><a href=\"#On-COMPAS-data\" data-toc-modified-id=\"On-COMPAS-data-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>On COMPAS data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Predictive-models\" data-toc-modified-id=\"Predictive-models-3.2.1\"><span class=\"toc-item-num\">3.2.1&nbsp;&nbsp;</span>Predictive models</a></span></li></ul></li></ul></li></ul></div>"
    
    Riku-Laine's avatar
    Riku-Laine committed
       ]
      },
    
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
        "# Bachelors thesis' analyses\n",
        "\n",
        "*This Jupyter notebook is for the analyses and model building for Riku Laine's bachelors thesis*\n",
        "\n",
    
        "Table of contents is provided above. First I will briefly present the COMPAS data set and then create the synthetic data set as done by Lakkaraju *et al.* ([link](https://helka.finna.fi/PrimoRecord/pci.acm3098066)). Then I will proceed to implement algorithms. Finally I will do the side-by-side comparisons of the results on the synthetic data. Finally I run the causal model on the COMPAS data.\n",
    
        "## Data sets\n",
        "\n",
        "*Below I load the COMPAS data set and generate the synthetic one.*\n",
    
        "\n",
        "### COMPAS data\n",
    
        "The following data filtering procedure follows the one described in the [ProPublica methodology](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm)."
    
       ]
      },
      {
       "cell_type": "code",
    
       "execution_count": 35,
    
       "metadata": {},
       "outputs": [],
       "source": [
    
        "# Imports\n",
    
        "\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "from datetime import datetime\n",
        "import matplotlib.pyplot as plt\n",
        "import scipy.stats as scs\n",
    
        "import scipy.integrate as si\n",
    
        "import seaborn as sns\n",
        "import numpy.random as npr\n",
        "from sklearn.preprocessing import OneHotEncoder\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.ensemble import RandomForestClassifier\n",
        "\n",
    
        "# Settings\n",
        "\n",
    
        "%matplotlib inline\n",
        "\n",
        "plt.rcParams.update({'font.size': 16})\n",
    
        "plt.rcParams.update({'figure.figsize': (14, 7)})\n",
        "\n",
        "# Suppress deprecation warnings.\n",
        "\n",
        "import warnings\n",
        "\n",
        "def fxn():\n",
        "    warnings.warn(\"deprecated\", DeprecationWarning)\n",
        "\n",
        "with warnings.catch_warnings():\n",
        "    warnings.simplefilter(\"ignore\")\n",
        "    fxn()"
    
       "execution_count": 36,
    
       "metadata": {},
       "outputs": [
    
        {
         "name": "stdout",
         "output_type": "stream",
         "text": [
    
          "(7214, 53)\n",
          "['id' 'name' 'first' 'last' 'compas_screening_date' 'sex' 'dob' 'age'\n",
          " 'age_cat' 'race' 'juv_fel_count' 'decile_score' 'juv_misd_count'\n",
          " 'juv_other_count' 'priors_count' 'days_b_screening_arrest' 'c_jail_in'\n",
          " 'c_jail_out' 'c_case_number' 'c_offense_date' 'c_arrest_date'\n",
          " 'c_days_from_compas' 'c_charge_degree' 'c_charge_desc' 'is_recid'\n",
          " 'r_case_number' 'r_charge_degree' 'r_days_from_arrest' 'r_offense_date'\n",
          " 'r_charge_desc' 'r_jail_in' 'r_jail_out' 'violent_recid'\n",
          " 'is_violent_recid' 'vr_case_number' 'vr_charge_degree' 'vr_offense_date'\n",
          " 'vr_charge_desc' 'type_of_assessment' 'decile_score.1' 'score_text'\n",
          " 'screening_date' 'v_type_of_assessment' 'v_decile_score' 'v_score_text'\n",
          " 'v_screening_date' 'in_custody' 'out_custody' 'priors_count.1' 'start'\n",
          " 'end' 'event' 'two_year_recid']\n"
    
        }
       ],
       "source": [
        "# Read file\n",
        "compas_raw = pd.read_csv(\"../data/compas-scores-two-years.csv\")\n",
        "\n",
        "# Check dimensions, number of rows should be 7214\n",
    
        "print(compas_raw.shape)\n",
        "print(compas_raw.columns.values)"
    
       ]
      },
      {
       "cell_type": "code",
    
       "execution_count": 37,
    
       "metadata": {},
       "outputs": [
        {
         "data": {
          "text/plain": [
           "(6172, 13)"
          ]
         },
    
         "execution_count": 37,
    
         "metadata": {},
         "output_type": "execute_result"
        }
       ],
       "source": [
        "# Select columns\n",
    
    Riku-Laine's avatar
    Riku-Laine committed
        "compas = compas_raw[[\n",
        "    'age', 'c_charge_degree', 'race', 'age_cat', 'score_text', 'sex',\n",
        "    'priors_count', 'days_b_screening_arrest', 'decile_score', 'is_recid',\n",
        "    'two_year_recid', 'c_jail_in', 'c_jail_out'\n",
        "]]\n",
    
        "# Subset values, see reasons in ProPublica methodology.\n",
    
        "compas = compas.query('days_b_screening_arrest <= 30 and \\\n",
        "                      days_b_screening_arrest >= -30 and \\\n",
        "                      is_recid != -1 and \\\n",
        "                      c_charge_degree != \"O\"')\n",
        "\n",
        "# Drop row if score_text is na\n",
        "compas = compas[compas.score_text.notnull()]\n",
        "\n",
        "compas.shape"
       ]
      },
      {
       "cell_type": "code",
    
       "execution_count": 38,
    
       "outputs": [
        {
         "data": {
          "text/html": [
           "<div>\n",
           "<style scoped>\n",
           "    .dataframe tbody tr th:only-of-type {\n",
           "        vertical-align: middle;\n",
           "    }\n",
           "\n",
           "    .dataframe tbody tr th {\n",
           "        vertical-align: top;\n",
           "    }\n",
           "\n",
           "    .dataframe thead th {\n",
           "        text-align: right;\n",
           "    }\n",
           "</style>\n",
           "<table border=\"1\" class=\"dataframe\">\n",
           "  <thead>\n",
           "    <tr style=\"text-align: right;\">\n",
           "      <th></th>\n",
           "      <th>0</th>\n",
           "      <th>1</th>\n",
           "      <th>2</th>\n",
    
           "    </tr>\n",
           "  </thead>\n",
           "  <tbody>\n",
           "    <tr>\n",
    
           "      <th>id</th>\n",
           "      <td>1</td>\n",
           "      <td>3</td>\n",
           "      <td>4</td>\n",
           "      <td>5</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
    
           "      <th>name</th>\n",
           "      <td>miguel hernandez</td>\n",
           "      <td>kevon dixon</td>\n",
           "      <td>ed philo</td>\n",
           "      <td>marcu brown</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
    
           "      <th>first</th>\n",
           "      <td>miguel</td>\n",
           "      <td>kevon</td>\n",
           "      <td>ed</td>\n",
           "      <td>marcu</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
    
           "      <th>last</th>\n",
           "      <td>hernandez</td>\n",
           "      <td>dixon</td>\n",
           "      <td>philo</td>\n",
           "      <td>brown</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
    
           "      <th>compas_screening_date</th>\n",
           "      <td>2013-08-14</td>\n",
           "      <td>2013-01-27</td>\n",
           "      <td>2013-04-14</td>\n",
           "      <td>2013-01-13</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
           "      <th>sex</th>\n",
           "      <td>Male</td>\n",
           "      <td>Male</td>\n",
           "      <td>Male</td>\n",
    
           "      <td>Male</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
    
           "      <th>dob</th>\n",
           "      <td>1947-04-18</td>\n",
           "      <td>1982-01-22</td>\n",
           "      <td>1991-05-14</td>\n",
           "      <td>1993-01-21</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>age</th>\n",
           "      <td>69</td>\n",
           "      <td>34</td>\n",
           "      <td>24</td>\n",
           "      <td>23</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>age_cat</th>\n",
           "      <td>Greater than 45</td>\n",
           "      <td>25 - 45</td>\n",
           "      <td>Less than 25</td>\n",
           "      <td>Less than 25</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>race</th>\n",
           "      <td>Other</td>\n",
           "      <td>African-American</td>\n",
           "      <td>African-American</td>\n",
           "      <td>African-American</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>juv_fel_count</th>\n",
           "      <td>0</td>\n",
    
           "      <td>0</td>\n",
           "      <td>0</td>\n",
    
           "      <td>0</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
           "      <th>decile_score</th>\n",
           "      <td>1</td>\n",
           "      <td>3</td>\n",
           "      <td>4</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
    
           "      <th>juv_misd_count</th>\n",
           "      <td>0</td>\n",
    
           "      <td>0</td>\n",
    
           "      <td>0</td>\n",
           "      <td>1</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
    
           "      <th>juv_other_count</th>\n",
           "      <td>0</td>\n",
    
           "      <td>0</td>\n",
           "      <td>1</td>\n",
    
           "      <td>0</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
           "      <th>priors_count</th>\n",
           "      <td>0</td>\n",
           "      <td>0</td>\n",
           "      <td>4</td>\n",
    
           "      <td>1</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
           "      <th>days_b_screening_arrest</th>\n",
           "      <td>-1</td>\n",
           "      <td>-1</td>\n",
           "      <td>-1</td>\n",
           "      <td>NaN</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
           "      <th>c_jail_in</th>\n",
           "      <td>2013-08-13 06:03:42</td>\n",
           "      <td>2013-01-26 03:45:27</td>\n",
           "      <td>2013-04-13 04:58:34</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
           "      <th>c_jail_out</th>\n",
           "      <td>2013-08-14 05:41:20</td>\n",
           "      <td>2013-02-05 05:36:53</td>\n",
           "      <td>2013-04-14 07:02:04</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
    
           "      <th>c_case_number</th>\n",
           "      <td>13011352CF10A</td>\n",
           "      <td>13001275CF10A</td>\n",
           "      <td>13005330CF10A</td>\n",
           "      <td>13000570CF10A</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>c_offense_date</th>\n",
           "      <td>2013-08-13</td>\n",
           "      <td>2013-01-26</td>\n",
           "      <td>2013-04-13</td>\n",
           "      <td>2013-01-12</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>c_arrest_date</th>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>c_days_from_compas</th>\n",
           "      <td>1</td>\n",
           "      <td>1</td>\n",
           "      <td>1</td>\n",
           "      <td>1</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>c_charge_degree</th>\n",
           "      <td>F</td>\n",
           "      <td>F</td>\n",
           "      <td>F</td>\n",
           "      <td>F</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>c_charge_desc</th>\n",
           "      <td>Aggravated Assault w/Firearm</td>\n",
           "      <td>Felony Battery w/Prior Convict</td>\n",
           "      <td>Possession of Cocaine</td>\n",
           "      <td>Possession of Cannabis</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>is_recid</th>\n",
    
           "      <td>0</td>\n",
           "      <td>1</td>\n",
    
           "      <td>1</td>\n",
    
           "      <td>0</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>r_case_number</th>\n",
           "      <td>NaN</td>\n",
           "      <td>13009779CF10A</td>\n",
           "      <td>13011511MM10A</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>r_charge_degree</th>\n",
           "      <td>NaN</td>\n",
           "      <td>(F3)</td>\n",
           "      <td>(M1)</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>r_days_from_arrest</th>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>0</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>r_offense_date</th>\n",
           "      <td>NaN</td>\n",
           "      <td>2013-07-05</td>\n",
           "      <td>2013-06-16</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>r_charge_desc</th>\n",
           "      <td>NaN</td>\n",
           "      <td>Felony Battery (Dom Strang)</td>\n",
           "      <td>Driving Under The Influence</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>r_jail_in</th>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>2013-06-16</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>r_jail_out</th>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>2013-06-16</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>violent_recid</th>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>is_violent_recid</th>\n",
           "      <td>0</td>\n",
           "      <td>1</td>\n",
           "      <td>0</td>\n",
           "      <td>0</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>vr_case_number</th>\n",
           "      <td>NaN</td>\n",
           "      <td>13009779CF10A</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>vr_charge_degree</th>\n",
           "      <td>NaN</td>\n",
           "      <td>(F3)</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>vr_offense_date</th>\n",
           "      <td>NaN</td>\n",
           "      <td>2013-07-05</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>vr_charge_desc</th>\n",
           "      <td>NaN</td>\n",
           "      <td>Felony Battery (Dom Strang)</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>type_of_assessment</th>\n",
           "      <td>Risk of Recidivism</td>\n",
           "      <td>Risk of Recidivism</td>\n",
           "      <td>Risk of Recidivism</td>\n",
           "      <td>Risk of Recidivism</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>decile_score.1</th>\n",
           "      <td>1</td>\n",
           "      <td>3</td>\n",
           "      <td>4</td>\n",
           "      <td>8</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>score_text</th>\n",
           "      <td>Low</td>\n",
           "      <td>Low</td>\n",
           "      <td>Low</td>\n",
           "      <td>High</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>screening_date</th>\n",
           "      <td>2013-08-14</td>\n",
           "      <td>2013-01-27</td>\n",
           "      <td>2013-04-14</td>\n",
           "      <td>2013-01-13</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>v_type_of_assessment</th>\n",
           "      <td>Risk of Violence</td>\n",
           "      <td>Risk of Violence</td>\n",
           "      <td>Risk of Violence</td>\n",
           "      <td>Risk of Violence</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>v_decile_score</th>\n",
           "      <td>1</td>\n",
           "      <td>1</td>\n",
           "      <td>3</td>\n",
    
           "      <td>6</td>\n",
    
           "    </tr>\n",
           "    <tr>\n",
           "      <th>v_score_text</th>\n",
           "      <td>Low</td>\n",
           "      <td>Low</td>\n",
           "      <td>Low</td>\n",
           "      <td>Medium</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>v_screening_date</th>\n",
           "      <td>2013-08-14</td>\n",
           "      <td>2013-01-27</td>\n",
           "      <td>2013-04-14</td>\n",
           "      <td>2013-01-13</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>in_custody</th>\n",
           "      <td>2014-07-07</td>\n",
           "      <td>2013-01-26</td>\n",
           "      <td>2013-06-16</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>out_custody</th>\n",
           "      <td>2014-07-14</td>\n",
           "      <td>2013-02-05</td>\n",
           "      <td>2013-06-16</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>priors_count.1</th>\n",
           "      <td>0</td>\n",
           "      <td>0</td>\n",
           "      <td>4</td>\n",
           "      <td>1</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>start</th>\n",
           "      <td>0</td>\n",
           "      <td>9</td>\n",
           "      <td>0</td>\n",
           "      <td>0</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>end</th>\n",
           "      <td>327</td>\n",
           "      <td>159</td>\n",
           "      <td>63</td>\n",
           "      <td>1174</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>event</th>\n",
           "      <td>0</td>\n",
           "      <td>1</td>\n",
           "      <td>0</td>\n",
           "      <td>0</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>two_year_recid</th>\n",
           "      <td>0</td>\n",
           "      <td>1</td>\n",
           "      <td>1</td>\n",
           "      <td>0</td>\n",
    
           "    </tr>\n",
           "  </tbody>\n",
           "</table>\n",
           "</div>"
          ],
          "text/plain": [
    
           "                                                    0  \\\n",
           "id                                                  1   \n",
           "name                                 miguel hernandez   \n",
           "first                                          miguel   \n",
           "last                                        hernandez   \n",
           "compas_screening_date                      2013-08-14   \n",
           "sex                                              Male   \n",
           "dob                                        1947-04-18   \n",
           "age                                                69   \n",
           "age_cat                               Greater than 45   \n",
           "race                                            Other   \n",
           "juv_fel_count                                       0   \n",
           "decile_score                                        1   \n",
           "juv_misd_count                                      0   \n",
           "juv_other_count                                     0   \n",
           "priors_count                                        0   \n",
           "days_b_screening_arrest                            -1   \n",
           "c_jail_in                         2013-08-13 06:03:42   \n",
           "c_jail_out                        2013-08-14 05:41:20   \n",
           "c_case_number                           13011352CF10A   \n",
           "c_offense_date                             2013-08-13   \n",
           "c_arrest_date                                     NaN   \n",
           "c_days_from_compas                                  1   \n",
           "c_charge_degree                                     F   \n",
           "c_charge_desc            Aggravated Assault w/Firearm   \n",
           "is_recid                                            0   \n",
           "r_case_number                                     NaN   \n",
           "r_charge_degree                                   NaN   \n",
           "r_days_from_arrest                                NaN   \n",
           "r_offense_date                                    NaN   \n",
           "r_charge_desc                                     NaN   \n",
           "r_jail_in                                         NaN   \n",
           "r_jail_out                                        NaN   \n",
           "violent_recid                                     NaN   \n",
           "is_violent_recid                                    0   \n",
           "vr_case_number                                    NaN   \n",
           "vr_charge_degree                                  NaN   \n",
           "vr_offense_date                                   NaN   \n",
           "vr_charge_desc                                    NaN   \n",
           "type_of_assessment                 Risk of Recidivism   \n",
           "decile_score.1                                      1   \n",
           "score_text                                        Low   \n",
           "screening_date                             2013-08-14   \n",
           "v_type_of_assessment                 Risk of Violence   \n",
           "v_decile_score                                      1   \n",
           "v_score_text                                      Low   \n",
           "v_screening_date                           2013-08-14   \n",
           "in_custody                                 2014-07-07   \n",
           "out_custody                                2014-07-14   \n",
           "priors_count.1                                      0   \n",
           "start                                               0   \n",
           "end                                               327   \n",
           "event                                               0   \n",
           "two_year_recid                                      0   \n",
           "\n",
           "                                                      1  \\\n",
           "id                                                    3   \n",
           "name                                        kevon dixon   \n",
           "first                                             kevon   \n",
           "last                                              dixon   \n",
           "compas_screening_date                        2013-01-27   \n",
           "sex                                                Male   \n",
           "dob                                          1982-01-22   \n",
           "age                                                  34   \n",
           "age_cat                                         25 - 45   \n",
           "race                                   African-American   \n",
           "juv_fel_count                                         0   \n",
           "decile_score                                          3   \n",
           "juv_misd_count                                        0   \n",
           "juv_other_count                                       0   \n",
           "priors_count                                          0   \n",
           "days_b_screening_arrest                              -1   \n",
           "c_jail_in                           2013-01-26 03:45:27   \n",
           "c_jail_out                          2013-02-05 05:36:53   \n",
           "c_case_number                             13001275CF10A   \n",
           "c_offense_date                               2013-01-26   \n",
           "c_arrest_date                                       NaN   \n",
           "c_days_from_compas                                    1   \n",
           "c_charge_degree                                       F   \n",
           "c_charge_desc            Felony Battery w/Prior Convict   \n",
           "is_recid                                              1   \n",
           "r_case_number                             13009779CF10A   \n",
           "r_charge_degree                                    (F3)   \n",
           "r_days_from_arrest                                  NaN   \n",
           "r_offense_date                               2013-07-05   \n",
           "r_charge_desc               Felony Battery (Dom Strang)   \n",
           "r_jail_in                                           NaN   \n",
           "r_jail_out                                          NaN   \n",
           "violent_recid                                       NaN   \n",
           "is_violent_recid                                      1   \n",
           "vr_case_number                            13009779CF10A   \n",
           "vr_charge_degree                                   (F3)   \n",
           "vr_offense_date                              2013-07-05   \n",
           "vr_charge_desc              Felony Battery (Dom Strang)   \n",
           "type_of_assessment                   Risk of Recidivism   \n",
           "decile_score.1                                        3   \n",
           "score_text                                          Low   \n",
           "screening_date                               2013-01-27   \n",
           "v_type_of_assessment                   Risk of Violence   \n",
           "v_decile_score                                        1   \n",
           "v_score_text                                        Low   \n",
           "v_screening_date                             2013-01-27   \n",
           "in_custody                                   2013-01-26   \n",
           "out_custody                                  2013-02-05   \n",
           "priors_count.1                                        0   \n",
           "start                                                 9   \n",
           "end                                                 159   \n",
           "event                                                 1   \n",
           "two_year_recid                                        1   \n",
    
           "                                                   2                       3  \n",
           "id                                                 4                       5  \n",
           "name                                        ed philo             marcu brown  \n",
           "first                                             ed                   marcu  \n",
           "last                                           philo                   brown  \n",
           "compas_screening_date                     2013-04-14              2013-01-13  \n",
           "sex                                             Male                    Male  \n",
           "dob                                       1991-05-14              1993-01-21  \n",
           "age                                               24                      23  \n",
           "age_cat                                 Less than 25            Less than 25  \n",
           "race                                African-American        African-American  \n",
           "juv_fel_count                                      0                       0  \n",
           "decile_score                                       4                       8  \n",
           "juv_misd_count                                     0                       1  \n",
           "juv_other_count                                    1                       0  \n",
           "priors_count                                       4                       1  \n",
           "days_b_screening_arrest                           -1                     NaN  \n",
           "c_jail_in                        2013-04-13 04:58:34                     NaN  \n",
           "c_jail_out                       2013-04-14 07:02:04                     NaN  \n",
           "c_case_number                          13005330CF10A           13000570CF10A  \n",
           "c_offense_date                            2013-04-13              2013-01-12  \n",
           "c_arrest_date                                    NaN                     NaN  \n",
           "c_days_from_compas                                 1                       1  \n",
           "c_charge_degree                                    F                       F  \n",
           "c_charge_desc                  Possession of Cocaine  Possession of Cannabis  \n",
           "is_recid                                           1                       0  \n",
           "r_case_number                          13011511MM10A                     NaN  \n",
           "r_charge_degree                                 (M1)                     NaN  \n",
           "r_days_from_arrest                                 0                     NaN  \n",
           "r_offense_date                            2013-06-16                     NaN  \n",
           "r_charge_desc            Driving Under The Influence                     NaN  \n",
           "r_jail_in                                 2013-06-16                     NaN  \n",
           "r_jail_out                                2013-06-16                     NaN  \n",
           "violent_recid                                    NaN                     NaN  \n",
           "is_violent_recid                                   0                       0  \n",
           "vr_case_number                                   NaN                     NaN  \n",
           "vr_charge_degree                                 NaN                     NaN  \n",
           "vr_offense_date                                  NaN                     NaN  \n",
           "vr_charge_desc                                   NaN                     NaN  \n",
           "type_of_assessment                Risk of Recidivism      Risk of Recidivism  \n",
           "decile_score.1                                     4                       8  \n",
           "score_text                                       Low                    High  \n",
           "screening_date                            2013-04-14              2013-01-13  \n",
           "v_type_of_assessment                Risk of Violence        Risk of Violence  \n",
           "v_decile_score                                     3                       6  \n",
           "v_score_text                                     Low                  Medium  \n",
           "v_screening_date                          2013-04-14              2013-01-13  \n",
           "in_custody                                2013-06-16                     NaN  \n",
           "out_custody                               2013-06-16                     NaN  \n",
           "priors_count.1                                     4                       1  \n",
           "start                                              0                       0  \n",
           "end                                               63                    1174  \n",
           "event                                              0                       0  \n",
           "two_year_recid                                     1                       0  "
    
          ]
         },
         "metadata": {},
         "output_type": "display_data"
    
        "# Calculate length of stay\n",
    
        "out = pd.to_datetime(compas.c_jail_out, format=\"%Y-%m-%d %H:%M:%S\")\n",
        "in_ = pd.to_datetime(compas.c_jail_in, format=\"%Y-%m-%d %H:%M:%S\")\n",
        "\n",
        "compas['length_of_stay'] = (out - in_).astype('timedelta64[D]')\n",
        "\n",
        "# Structure of the data\n",
    
        "display(compas_raw.head(4).T)\n",
    
        "#print(np.sum(compas_raw.c_arrest_date.isnull()))"
       ]
      },
      {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
        "**Columns:**\n",
        "\n",
        "* id = identification number\n",
        "* name \n",
        "* first (name)\n",
        "* last (name)\n",
        "* compas_screening_date = date of COMPAS filling\n",
        "* sex\n",
    
        "* dob = date of birth\n",
    
        "* age\n",
        "* age_cat\n",
        "* race\n",
        "* juv_fel_count = No. of juvenile felonies\n",
        "* decile_score = decile score of COMPAS\n",
        "* juv_misd_count = No. of juvenile misdemeanors\n",
        "* juv_other_count = No. of other crimes juvenile \n",
        "* priors_count = No. of priors \n",
        "* days_b_screening_arrest = date of a defendants Compas scored crime - date for person's arrest (c_offense_date - screening_date) \n",
        "* c_jail_in = jailing date of COMPAS scored crime\n",
        "* c_jail_out = jailing date of COMPAS scored crime\n",
    
        "* c_case_number = case number of COMPAS scored crime\n",
        "* c_offense_date = offense date of COMPAS scored crime\n",
        "* c_arrest_date = arrest date of COMPAS scored crime\n",
        "* c_days_from_compas = \n",
    
        "* c_charge_degree\n",
        "* c_charge_desc\n",
        "* is_recid\n",
        "* r_case_number\n",
        "* r_charge_degree\n",
        "* r_days_from_arrest\n",
        "* r_offense_date\n",
        "* r_charge_desc\n",
        "* r_jail_in\n",
        "* r_jail_out\n",
        "* violent_recid\n",
        "* is_violent_recid\n",
        "* vr_case_number\n",
        "* vr_charge_degree\n",
        "* vr_offense_date\n",
        "* vr_charge_desc\n",
        "* type_of_assessment\n",
        "* decile_score.1\n",
        "* score_text\n",
        "* screening_date\n",
        "* v_type_of_assessment\n",
        "* v_decile_score\n",
        "* v_score_text\n",
        "* v_screening_date\n",
        "* in_custody\n",
        "* out_custody\n",
        "* priors_count.1\n",
        "* start\n",
        "* end\n",
        "* event\n",
    
        "* two_year_recid\n",
        "\n",
        "Let's obtain the basic statistics for each of the variables."
    
       "execution_count": 39,
    
       "metadata": {
        "scrolled": false
       },
       "outputs": [
        {
         "data": {
          "text/html": [
           "<div>\n",
           "<style scoped>\n",
           "    .dataframe tbody tr th:only-of-type {\n",
           "        vertical-align: middle;\n",
           "    }\n",
           "\n",
           "    .dataframe tbody tr th {\n",
           "        vertical-align: top;\n",
           "    }\n",
           "\n",
           "    .dataframe thead th {\n",
           "        text-align: right;\n",
           "    }\n",
           "</style>\n",
           "<table border=\"1\" class=\"dataframe\">\n",
           "  <thead>\n",
           "    <tr style=\"text-align: right;\">\n",
           "      <th></th>\n",
           "      <th>count</th>\n",
           "      <th>unique</th>\n",
           "      <th>top</th>\n",
           "      <th>freq</th>\n",
           "      <th>mean</th>\n",
           "      <th>std</th>\n",
           "      <th>min</th>\n",
           "      <th>25%</th>\n",
           "      <th>50%</th>\n",
           "      <th>75%</th>\n",
           "      <th>max</th>\n",
           "    </tr>\n",
           "  </thead>\n",
           "  <tbody>\n",
           "    <tr>\n",
           "      <th>age</th>\n",
           "      <td>6172</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>34.5345</td>\n",
           "      <td>11.7309</td>\n",
           "      <td>18</td>\n",
           "      <td>25</td>\n",
           "      <td>31</td>\n",
           "      <td>42</td>\n",
           "      <td>96</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>c_charge_degree</th>\n",
           "      <td>6172</td>\n",
           "      <td>2</td>\n",
           "      <td>F</td>\n",
           "      <td>3970</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>race</th>\n",
           "      <td>6172</td>\n",
           "      <td>6</td>\n",
           "      <td>African-American</td>\n",
           "      <td>3175</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>age_cat</th>\n",
           "      <td>6172</td>\n",
           "      <td>3</td>\n",
           "      <td>25 - 45</td>\n",
           "      <td>3532</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>score_text</th>\n",
           "      <td>6172</td>\n",
           "      <td>3</td>\n",
           "      <td>Low</td>\n",
           "      <td>3421</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>sex</th>\n",
           "      <td>6172</td>\n",
           "      <td>2</td>\n",
           "      <td>Male</td>\n",
           "      <td>4997</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>priors_count</th>\n",
           "      <td>6172</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>3.24644</td>\n",
           "      <td>4.74377</td>\n",
           "      <td>0</td>\n",
           "      <td>0</td>\n",
           "      <td>1</td>\n",
           "      <td>4</td>\n",
           "      <td>38</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>days_b_screening_arrest</th>\n",
           "      <td>6172</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>-1.74028</td>\n",
           "      <td>5.08471</td>\n",
           "      <td>-30</td>\n",
           "      <td>-1</td>\n",
           "      <td>-1</td>\n",
           "      <td>-1</td>\n",
           "      <td>30</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>decile_score</th>\n",
           "      <td>6172</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>4.4185</td>\n",
           "      <td>2.83946</td>\n",
           "      <td>1</td>\n",
           "      <td>2</td>\n",
           "      <td>4</td>\n",
           "      <td>7</td>\n",
           "      <td>10</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>is_recid</th>\n",
           "      <td>6172</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>0.484446</td>\n",
           "      <td>0.499799</td>\n",
           "      <td>0</td>\n",
           "      <td>0</td>\n",
           "      <td>0</td>\n",
           "      <td>1</td>\n",
           "      <td>1</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>two_year_recid</th>\n",
           "      <td>6172</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",
           "      <td>0.45512</td>\n",
           "      <td>0.498022</td>\n",
           "      <td>0</td>\n",
           "      <td>0</td>\n",
           "      <td>0</td>\n",
           "      <td>1</td>\n",
           "      <td>1</td>\n",
           "    </tr>\n",
           "    <tr>\n",
           "      <th>c_jail_in</th>\n",
           "      <td>6172</td>\n",
           "      <td>6172</td>\n",
    
           "      <td>2014-01-05 10:19:57</td>\n",
    
           "      <td>1</td>\n",
           "      <td>NaN</td>\n",
           "      <td>NaN</td>\n",