Newer
Older
{
"cell_type": "markdown",
"metadata": {
"toc": true
},
"source": [
"<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
"<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Data-sets\" data-toc-modified-id=\"Data-sets-1\"><span class=\"toc-item-num\">1 </span>Data sets</a></span><ul class=\"toc-item\"><li><span><a href=\"#COMPAS-data\" data-toc-modified-id=\"COMPAS-data-1.1\"><span class=\"toc-item-num\">1.1 </span>COMPAS data</a></span></li><li><span><a href=\"#Synthetic-data\" data-toc-modified-id=\"Synthetic-data-1.2\"><span class=\"toc-item-num\">1.2 </span>Synthetic data</a></span></li></ul></li><li><span><a href=\"#Algorithms\" data-toc-modified-id=\"Algorithms-2\"><span class=\"toc-item-num\">2 </span>Algorithms</a></span><ul class=\"toc-item\"><li><span><a href=\"#Contraction-algorithm\" data-toc-modified-id=\"Contraction-algorithm-2.1\"><span class=\"toc-item-num\">2.1 </span>Contraction algorithm</a></span></li><li><span><a href=\"#Causal-model\" data-toc-modified-id=\"Causal-model-2.2\"><span class=\"toc-item-num\">2.2 </span>Causal model</a></span></li></ul></li><li><span><a href=\"#Performance-comparison\" data-toc-modified-id=\"Performance-comparison-3\"><span class=\"toc-item-num\">3 </span>Performance comparison</a></span><ul class=\"toc-item\"><li><span><a href=\"#On-synthetic-data\" data-toc-modified-id=\"On-synthetic-data-3.1\"><span class=\"toc-item-num\">3.1 </span>On synthetic data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Predictive-models\" data-toc-modified-id=\"Predictive-models-3.1.1\"><span class=\"toc-item-num\">3.1.1 </span>Predictive models</a></span></li><li><span><a href=\"#Visual-comparison\" data-toc-modified-id=\"Visual-comparison-3.1.2\"><span class=\"toc-item-num\">3.1.2 </span>Visual comparison</a></span></li></ul></li><li><span><a href=\"#On-COMPAS-data\" data-toc-modified-id=\"On-COMPAS-data-3.2\"><span class=\"toc-item-num\">3.2 </span>On COMPAS data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Predictive-models\" data-toc-modified-id=\"Predictive-models-3.2.1\"><span class=\"toc-item-num\">3.2.1 </span>Predictive models</a></span></li></ul></li></ul></li></ul></div>"
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bachelors thesis' analyses\n",
"\n",
"*This Jupyter notebook is for the analyses and model building for Riku Laine's bachelors thesis*\n",
"\n",
"Table of contents is provided above. First I will briefly present the COMPAS data set and then create the synthetic data set as done by Lakkaraju *et al.* ([link](https://helka.finna.fi/PrimoRecord/pci.acm3098066)). Then I will proceed to implement algorithms. Finally I will do the side-by-side comparisons of the results on the synthetic data. Finally I run the causal model on the COMPAS data.\n",
"## Data sets\n",
"\n",
"*Below I load the COMPAS data set and generate the synthetic one.*\n",
"\n",
"### COMPAS data\n",
"The following data filtering procedure follows the one described in the [ProPublica methodology](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm)."
]
},
{
"cell_type": "code",
{
"name": "stdout",
"output_type": "stream",
"text": [
"(7214, 53)\n",
"['id' 'name' 'first' 'last' 'compas_screening_date' 'sex' 'dob' 'age'\n",
" 'age_cat' 'race' 'juv_fel_count' 'decile_score' 'juv_misd_count'\n",
" 'juv_other_count' 'priors_count' 'days_b_screening_arrest' 'c_jail_in'\n",
" 'c_jail_out' 'c_case_number' 'c_offense_date' 'c_arrest_date'\n",
" 'c_days_from_compas' 'c_charge_degree' 'c_charge_desc' 'is_recid'\n",
" 'r_case_number' 'r_charge_degree' 'r_days_from_arrest' 'r_offense_date'\n",
" 'r_charge_desc' 'r_jail_in' 'r_jail_out' 'violent_recid'\n",
" 'is_violent_recid' 'vr_case_number' 'vr_charge_degree' 'vr_offense_date'\n",
" 'vr_charge_desc' 'type_of_assessment' 'decile_score.1' 'score_text'\n",
" 'screening_date' 'v_type_of_assessment' 'v_decile_score' 'v_score_text'\n",
" 'v_screening_date' 'in_custody' 'out_custody' 'priors_count.1' 'start'\n",
" 'end' 'event' 'two_year_recid']\n"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"plt.rcParams.update({'font.size': 16})\n",
"plt.rcParams.update({'figure.figsize': (14, 7)})\n",
"\n",
"# Read file\n",
"compas_raw = pd.read_csv(\"../data/compas-scores-two-years.csv\")\n",
"\n",
"# Check dimensions, number of rows should be 7214\n",
"print(compas_raw.shape)\n",
"print(compas_raw.columns.values)"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(6172, 13)"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select columns\n",
"compas = compas_raw[[\n",
" 'age', 'c_charge_degree', 'race', 'age_cat', 'score_text', 'sex',\n",
" 'priors_count', 'days_b_screening_arrest', 'decile_score', 'is_recid',\n",
" 'two_year_recid', 'c_jail_in', 'c_jail_out'\n",
"]]\n",
"# Subset values, see reasons in ProPublica methodology.\n",
"compas = compas.query('days_b_screening_arrest <= 30 and \\\n",
" days_b_screening_arrest >= -30 and \\\n",
" is_recid != -1 and \\\n",
" c_charge_degree != \"O\"')\n",
"\n",
"# Drop row if score_text is na\n",
"compas = compas[compas.score_text.notnull()]\n",
"\n",
"compas.shape"
]
},
{
"cell_type": "code",
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>age</th>\n",
" <td>69</td>\n",
" <td>34</td>\n",
" <td>24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>c_charge_degree</th>\n",
" <td>F</td>\n",
" <td>F</td>\n",
" <td>F</td>\n",
" </tr>\n",
" <tr>\n",
" <th>race</th>\n",
" <td>Other</td>\n",
" <td>African-American</td>\n",
" <td>African-American</td>\n",
" <td>Other</td>\n",
" <td>Caucasian</td>\n",
" </tr>\n",
" <tr>\n",
" <th>age_cat</th>\n",
" <td>Greater than 45</td>\n",
" <td>25 - 45</td>\n",
" <td>Less than 25</td>\n",
" <td>25 - 45</td>\n",
" <td>25 - 45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>score_text</th>\n",
" <td>Low</td>\n",
" <td>Low</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>sex</th>\n",
" <td>Male</td>\n",
" <td>Male</td>\n",
" <td>Male</td>\n",
" <th>priors_count</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>days_b_screening_arrest</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>decile_score</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>is_recid</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>two_year_recid</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>c_jail_in</th>\n",
" <td>2013-08-13 06:03:42</td>\n",
" <td>2013-01-26 03:45:27</td>\n",
" <td>2013-04-13 04:58:34</td>\n",
" <td>2013-11-30 04:50:18</td>\n",
" <td>2014-02-18 05:08:24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>c_jail_out</th>\n",
" <td>2013-08-14 05:41:20</td>\n",
" <td>2013-02-05 05:36:53</td>\n",
" <td>2013-04-14 07:02:04</td>\n",
" <td>2013-12-01 12:28:56</td>\n",
" <td>2014-02-24 12:18:30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>length_of_stay</th>\n",
" <td>0</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 \\\n",
"age 69 34 \n",
"c_charge_degree F F \n",
"race Other African-American \n",
"age_cat Greater than 45 25 - 45 \n",
"score_text Low Low \n",
"sex Male Male \n",
"priors_count 0 0 \n",
"days_b_screening_arrest -1 -1 \n",
"decile_score 1 3 \n",
"is_recid 0 1 \n",
"two_year_recid 0 1 \n",
"c_jail_in 2013-08-13 06:03:42 2013-01-26 03:45:27 \n",
"c_jail_out 2013-08-14 05:41:20 2013-02-05 05:36:53 \n",
"length_of_stay 0 10 \n",
"\n",
Loading
Loading full blame...