Skip to content
Snippets Groups Projects
MachineLearning_inMolecularBiology2022_GroupProject.ipynb 865 KiB
Newer Older
jpronkko's avatar
jpronkko committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project Tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the first few assignments, we have learned how to infer part based components (known as mutational signatures) generated by particular mutational processes using Non-negative Matrix Factorization (NMF). By doing this, we are trying to reconstruct the mutation catalog in a given sample with mutational signatures and their contributions.\n",
    "\n",
    "In this group project, you will use similar mutational profiles and signature activities to predict cancer types but with much larger sample size. \n",
    "You should:\n",
    "* Separate the data into training and test groups within each cancer type.\n",
    "* Find out which features are informative for the prediction of the cancer type (label). You should combine the profiles and activities and use each data type independently.\n",
    "* Implement different models for classification of the samples given the input data and evaluate the model performance using test data to avoid overfitting. Explain briefly how does each model that you have used work.\n",
    "* Report model performance, using standard machine learning metrics such as confusion matrices etc. \n",
    "* Compare model performance across methods and across cancer types, are some types easier top predict than others.\n",
    "* Submit a single Jupyter notebook as the final report and present that during the last assignment session "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data include both mutational catalogs from multiple cancers and the predicted activities in the paper [\"Alexandrov LB, et al. (2020) The repertoire of mutational signatures in human cancer\"](https://www.nature.com/articles/s41586-020-1943-3). The data either are generated from whole human genome (WGS) or only exomes regions (WES). Since the exome region only constitutes about 1% of human genome, the total mutation numbers in these samples are, of course, much smaller. So if you plan to use WGS together with WES data, remember to normalize the profile for each sample to sum up to 1.\n",
    "\n",
    "Note that, the data is generated from different platforms by different research groups, some of them (e.g. labeled with PCAWG, TCGA) are processed with the same bioinformatics pipeline. Thus, these samples will have less variability related to data processing pipelines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cancer types might be labeled under the same tissue, e.g. 'Bone-Benign','Bone-Epith', which can also be combined together or take the one has more samples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a link to background reading [\"Pan-Cancer Analysis of Whole Genomes\"](https://www.nature.com/collections/afdejfafdb). Have a look especially the paper [\"A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns\"](https://www.nature.com/articles/s41467-019-13825-8)."
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mutational catalogs and activities - WGS data"
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Mutation type</th>\n",
       "      <th>Trinucleotide</th>\n",
       "      <th>Biliary-AdenoCA::SP117655</th>\n",
       "      <th>Biliary-AdenoCA::SP117556</th>\n",
       "      <th>Biliary-AdenoCA::SP117627</th>\n",
       "      <th>Biliary-AdenoCA::SP117775</th>\n",
       "      <th>Biliary-AdenoCA::SP117332</th>\n",
       "      <th>Biliary-AdenoCA::SP117712</th>\n",
       "      <th>Biliary-AdenoCA::SP117017</th>\n",
       "      <th>Biliary-AdenoCA::SP117031</th>\n",
       "      <th>...</th>\n",
       "      <th>Uterus-AdenoCA::SP94540</th>\n",
       "      <th>Uterus-AdenoCA::SP95222</th>\n",
       "      <th>Uterus-AdenoCA::SP89389</th>\n",
       "      <th>Uterus-AdenoCA::SP90503</th>\n",
       "      <th>Uterus-AdenoCA::SP92460</th>\n",
       "      <th>Uterus-AdenoCA::SP92931</th>\n",
       "      <th>Uterus-AdenoCA::SP91265</th>\n",
       "      <th>Uterus-AdenoCA::SP89909</th>\n",
       "      <th>Uterus-AdenoCA::SP90629</th>\n",
       "      <th>Uterus-AdenoCA::SP95550</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>C&gt;A</td>\n",
       "      <td>ACA</td>\n",
       "      <td>269</td>\n",
       "      <td>114</td>\n",
       "      <td>105</td>\n",
       "      <td>217</td>\n",
       "      <td>52</td>\n",
       "      <td>192</td>\n",
       "      <td>54</td>\n",
       "      <td>196</td>\n",
       "      <td>...</td>\n",
       "      <td>117</td>\n",
       "      <td>233</td>\n",
       "      <td>94</td>\n",
       "      <td>114</td>\n",
       "      <td>257</td>\n",
       "      <td>139</td>\n",
       "      <td>404</td>\n",
       "      <td>97</td>\n",
       "      <td>250</td>\n",
       "      <td>170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>C&gt;A</td>\n",
       "      <td>ACC</td>\n",
       "      <td>148</td>\n",
       "      <td>56</td>\n",
       "      <td>71</td>\n",
       "      <td>123</td>\n",
       "      <td>36</td>\n",
       "      <td>139</td>\n",
       "      <td>54</td>\n",
       "      <td>102</td>\n",
       "      <td>...</td>\n",
       "      <td>90</td>\n",
       "      <td>167</td>\n",
       "      <td>59</td>\n",
       "      <td>64</td>\n",
       "      <td>268</td>\n",
       "      <td>75</td>\n",
       "      <td>255</td>\n",
       "      <td>78</td>\n",
       "      <td>188</td>\n",
       "      <td>137</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 2782 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  Mutation type Trinucleotide  Biliary-AdenoCA::SP117655  \\\n",
       "0           C>A           ACA                        269   \n",
       "1           C>A           ACC                        148   \n",
       "\n",
       "   Biliary-AdenoCA::SP117556  Biliary-AdenoCA::SP117627  \\\n",
       "0                        114                        105   \n",
       "1                         56                         71   \n",
       "\n",
       "   Biliary-AdenoCA::SP117775  Biliary-AdenoCA::SP117332  \\\n",
       "0                        217                         52   \n",
       "1                        123                         36   \n",
       "\n",
       "   Biliary-AdenoCA::SP117712  Biliary-AdenoCA::SP117017  \\\n",
       "0                        192                         54   \n",
       "1                        139                         54   \n",
       "\n",
       "   Biliary-AdenoCA::SP117031  ...  Uterus-AdenoCA::SP94540  \\\n",
       "0                        196  ...                      117   \n",
       "1                        102  ...                       90   \n",
       "\n",
       "   Uterus-AdenoCA::SP95222  Uterus-AdenoCA::SP89389  Uterus-AdenoCA::SP90503  \\\n",
       "0                      233                       94                      114   \n",
       "1                      167                       59                       64   \n",
       "\n",
jpronkko's avatar
jpronkko committed
       "   Uterus-AdenoCA::SP92460  Uterus-AdenoCA::SP92931  Uterus-AdenoCA::SP91265  \\\n",
       "0                      257                      139                      404   \n",
       "1                      268                       75                      255   \n",
       "\n",
       "   Uterus-AdenoCA::SP89909  Uterus-AdenoCA::SP90629  Uterus-AdenoCA::SP95550  \n",
       "0                       97                      250                      170  \n",
       "1                       78                      188                      137  \n",
       "\n",
       "[2 rows x 2782 columns]"
      ]
     },
jpronkko's avatar
jpronkko committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## PCAWG data is performed by the same pipeline\n",
    "PCAWG_wgs_mut = pd.read_csv (\"./project_data/catalogs/WGS/WGS_PCAWG.96.csv\")\n",
    "PCAWG_wgs_mut.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accuracy is the cosine similarity of reconstruct catalog to the observed catalog "
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Cancer Types</th>\n",
       "      <th>Sample Names</th>\n",
       "      <th>Accuracy</th>\n",
       "      <th>SBS1</th>\n",
       "      <th>SBS2</th>\n",
       "      <th>SBS3</th>\n",
       "      <th>SBS4</th>\n",
       "      <th>SBS5</th>\n",
       "      <th>SBS6</th>\n",
       "      <th>SBS7a</th>\n",
       "      <th>...</th>\n",
       "      <th>SBS51</th>\n",
       "      <th>SBS52</th>\n",
       "      <th>SBS53</th>\n",
       "      <th>SBS54</th>\n",
       "      <th>SBS55</th>\n",
       "      <th>SBS56</th>\n",
       "      <th>SBS57</th>\n",
       "      <th>SBS58</th>\n",
       "      <th>SBS59</th>\n",
       "      <th>SBS60</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Biliary-AdenoCA</td>\n",
       "      <td>SP117655</td>\n",
       "      <td>0.968</td>\n",
       "      <td>1496</td>\n",
       "      <td>1296</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1825</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Biliary-AdenoCA</td>\n",
       "      <td>SP117556</td>\n",
       "      <td>0.963</td>\n",
       "      <td>985</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>922</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 68 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      Cancer Types Sample Names  Accuracy  SBS1  SBS2  SBS3  SBS4  SBS5  SBS6  \\\n",
       "0  Biliary-AdenoCA     SP117655     0.968  1496  1296     0     0  1825     0   \n",
       "1  Biliary-AdenoCA     SP117556     0.963   985     0     0     0   922     0   \n",
       "\n",
       "   SBS7a  ...  SBS51  SBS52  SBS53  SBS54  SBS55  SBS56  SBS57  SBS58  SBS59  \\\n",
       "0      0  ...      0      0      0      0      0      0      0      0      0   \n",
       "1      0  ...      0      0      0      0      0      0      0      0      0   \n",
       "\n",
       "   SBS60  \n",
       "0      0  \n",
       "1      0  \n",
       "\n",
       "[2 rows x 68 columns]"
      ]
     },
jpronkko's avatar
jpronkko committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Activities:\n",
    "PCAWG_wgs_act = pd.read_csv (\"./project_data/activities/WGS/WGS_PCAWG.activities.csv\")\n",
    "PCAWG_wgs_act.head(2)"
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Mutation type</th>\n",
       "      <th>Trinucleotide</th>\n",
       "      <th>ALL::PD4020a</th>\n",
       "      <th>ALL::SJBALL011_D</th>\n",
       "      <th>ALL::SJBALL012_D</th>\n",
       "      <th>ALL::SJBALL020013_D1</th>\n",
       "      <th>ALL::SJBALL020422_D1</th>\n",
       "      <th>ALL::SJBALL020579_D1</th>\n",
       "      <th>ALL::SJBALL020589_D1</th>\n",
       "      <th>ALL::SJBALL020625_D1</th>\n",
       "      <th>...</th>\n",
       "      <th>Stomach-AdenoCa::pfg316T</th>\n",
       "      <th>Stomach-AdenoCa::pfg317T</th>\n",
       "      <th>Stomach-AdenoCa::pfg344T</th>\n",
       "      <th>Stomach-AdenoCa::pfg373T</th>\n",
       "      <th>Stomach-AdenoCa::pfg375T</th>\n",
       "      <th>Stomach-AdenoCa::pfg378T</th>\n",
       "      <th>Stomach-AdenoCa::pfg398T</th>\n",
       "      <th>Stomach-AdenoCa::pfg413T</th>\n",
       "      <th>Stomach-AdenoCa::pfg416T</th>\n",
       "      <th>Stomach-AdenoCa::pfg424T</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>C&gt;A</td>\n",
       "      <td>ACA</td>\n",
       "      <td>35</td>\n",
       "      <td>9</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>5</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>...</td>\n",
       "      <td>133</td>\n",
       "      <td>185</td>\n",
       "      <td>202</td>\n",
       "      <td>185</td>\n",
       "      <td>96</td>\n",
       "      <td>134</td>\n",
       "      <td>12</td>\n",
       "      <td>279</td>\n",
       "      <td>75</td>\n",
       "      <td>135</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>C&gt;A</td>\n",
       "      <td>ACC</td>\n",
       "      <td>16</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>10</td>\n",
       "      <td>5</td>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>48</td>\n",
       "      <td>70</td>\n",
       "      <td>126</td>\n",
       "      <td>88</td>\n",
       "      <td>35</td>\n",
       "      <td>54</td>\n",
       "      <td>16</td>\n",
       "      <td>112</td>\n",
       "      <td>31</td>\n",
       "      <td>91</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 1867 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  Mutation type Trinucleotide  ALL::PD4020a  ALL::SJBALL011_D  \\\n",
       "0           C>A           ACA            35                 9   \n",
       "1           C>A           ACC            16                 2   \n",
       "\n",
       "   ALL::SJBALL012_D  ALL::SJBALL020013_D1  ALL::SJBALL020422_D1  \\\n",
       "0                 2                     7                     5   \n",
       "1                 4                    10                     5   \n",
       "\n",
       "   ALL::SJBALL020579_D1  ALL::SJBALL020589_D1  ALL::SJBALL020625_D1  ...  \\\n",
       "0                     7                     3                     5  ...   \n",
       "1                     9                     1                     2  ...   \n",
       "\n",
       "   Stomach-AdenoCa::pfg316T  Stomach-AdenoCa::pfg317T  \\\n",
       "0                       133                       185   \n",
       "1                        48                        70   \n",
       "\n",
       "   Stomach-AdenoCa::pfg344T  Stomach-AdenoCa::pfg373T  \\\n",
       "0                       202                       185   \n",
       "1                       126                        88   \n",
       "\n",
       "   Stomach-AdenoCa::pfg375T  Stomach-AdenoCa::pfg378T  \\\n",
       "0                        96                       134   \n",
       "1                        35                        54   \n",
       "\n",
       "   Stomach-AdenoCa::pfg398T  Stomach-AdenoCa::pfg413T  \\\n",
       "0                        12                       279   \n",
       "1                        16                       112   \n",
       "\n",
       "   Stomach-AdenoCa::pfg416T  Stomach-AdenoCa::pfg424T  \n",
       "0                        75                       135  \n",
       "1                        31                        91  \n",
       "\n",
       "[2 rows x 1867 columns]"
      ]
     },
jpronkko's avatar
jpronkko committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nonPCAWG_wgs_mut = pd.read_csv (\"./project_data/catalogs/WGS/WGS_Other.96.csv\")\n",
    "nonPCAWG_wgs_mut.head(2)"
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Cancer Types</th>\n",
       "      <th>Sample Names</th>\n",
       "      <th>Accuracy</th>\n",
       "      <th>SBS1</th>\n",
       "      <th>SBS2</th>\n",
       "      <th>SBS3</th>\n",
       "      <th>SBS4</th>\n",
       "      <th>SBS5</th>\n",
       "      <th>SBS6</th>\n",
       "      <th>SBS7a</th>\n",
       "      <th>...</th>\n",
       "      <th>SBS51</th>\n",
       "      <th>SBS52</th>\n",
       "      <th>SBS53</th>\n",
       "      <th>SBS54</th>\n",
       "      <th>SBS55</th>\n",
       "      <th>SBS56</th>\n",
       "      <th>SBS57</th>\n",
       "      <th>SBS58</th>\n",
       "      <th>SBS59</th>\n",
       "      <th>SBS60</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ALL</td>\n",
       "      <td>PD4020a</td>\n",
       "      <td>0.995</td>\n",
       "      <td>208</td>\n",
       "      <td>3006</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>365</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ALL</td>\n",
       "      <td>SJBALL011_D</td>\n",
       "      <td>0.905</td>\n",
       "      <td>66</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>144</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 68 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  Cancer Types Sample Names  Accuracy  SBS1  SBS2  SBS3  SBS4  SBS5  SBS6  \\\n",
       "0          ALL      PD4020a     0.995   208  3006     0     0   365     0   \n",
       "1          ALL  SJBALL011_D     0.905    66     0     0     0   144     0   \n",
       "\n",
       "   SBS7a  ...  SBS51  SBS52  SBS53  SBS54  SBS55  SBS56  SBS57  SBS58  SBS59  \\\n",
       "0      0  ...      0      0      0      0      0      0      0      0      0   \n",
       "1      0  ...      0      0      0      0      0      0      0      0      0   \n",
       "\n",
       "   SBS60  \n",
       "0      0  \n",
       "1      0  \n",
       "\n",
       "[2 rows x 68 columns]"
      ]
     },
jpronkko's avatar
jpronkko committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nonPCAWG_wgs_act = pd.read_csv (\"./project_data/activities/WGS/WGS_Other.activities.csv\")\n",
    "nonPCAWG_wgs_act.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mutational catalogs - WES data"
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Mutation type</th>\n",
       "      <th>Trinucleotide</th>\n",
       "      <th>AML::TCGA-AB-2802-03B-01W-0728-08</th>\n",
       "      <th>AML::TCGA-AB-2803-03B-01W-0728-08</th>\n",
       "      <th>AML::TCGA-AB-2804-03B-01W-0728-08</th>\n",
       "      <th>AML::TCGA-AB-2805-03B-01W-0728-08</th>\n",
       "      <th>AML::TCGA-AB-2806-03B-01W-0728-08</th>\n",
       "      <th>AML::TCGA-AB-2807-03B-01W-0728-08</th>\n",
       "      <th>AML::TCGA-AB-2808-03B-01W-0728-08</th>\n",
       "      <th>AML::TCGA-AB-2809-03D-01W-0755-09</th>\n",
       "      <th>...</th>\n",
       "      <th>Eye-Melanoma::TCGA-WC-A885-01A-11D-A39W-08</th>\n",
       "      <th>Eye-Melanoma::TCGA-WC-A888-01A-11D-A39W-08</th>\n",
       "      <th>Eye-Melanoma::TCGA-WC-A88A-01A-11D-A39W-08</th>\n",
       "      <th>Eye-Melanoma::TCGA-WC-AA9A-01A-11D-A39W-08</th>\n",
       "      <th>Eye-Melanoma::TCGA-WC-AA9E-01A-11D-A39W-08</th>\n",
       "      <th>Eye-Melanoma::TCGA-YZ-A980-01A-11D-A39W-08</th>\n",
       "      <th>Eye-Melanoma::TCGA-YZ-A982-01A-11D-A39W-08</th>\n",
       "      <th>Eye-Melanoma::TCGA-YZ-A983-01A-11D-A39W-08</th>\n",
       "      <th>Eye-Melanoma::TCGA-YZ-A984-01A-11D-A39W-08</th>\n",
       "      <th>Eye-Melanoma::TCGA-YZ-A985-01A-11D-A39W-08</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>C&gt;A</td>\n",
       "      <td>ACA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>C&gt;A</td>\n",
       "      <td>ACC</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 9495 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  Mutation type Trinucleotide  AML::TCGA-AB-2802-03B-01W-0728-08  \\\n",
       "0           C>A           ACA                                  0   \n",
       "1           C>A           ACC                                  0   \n",
       "\n",
       "   AML::TCGA-AB-2803-03B-01W-0728-08  AML::TCGA-AB-2804-03B-01W-0728-08  \\\n",
       "0                                  0                                  0   \n",
       "1                                  2                                  0   \n",
       "\n",
       "   AML::TCGA-AB-2805-03B-01W-0728-08  AML::TCGA-AB-2806-03B-01W-0728-08  \\\n",
       "0                                  0                                  4   \n",
       "1                                  0                                  0   \n",
       "\n",
       "   AML::TCGA-AB-2807-03B-01W-0728-08  AML::TCGA-AB-2808-03B-01W-0728-08  \\\n",
       "0                                  0                                  2   \n",
       "1                                  1                                  3   \n",
       "\n",
       "   AML::TCGA-AB-2809-03D-01W-0755-09  ...  \\\n",
       "0                                  0  ...   \n",
       "1                                  0  ...   \n",
       "\n",
       "   Eye-Melanoma::TCGA-WC-A885-01A-11D-A39W-08  \\\n",
       "0                                           1   \n",
       "1                                           0   \n",
       "\n",
       "   Eye-Melanoma::TCGA-WC-A888-01A-11D-A39W-08  \\\n",
       "0                                           0   \n",
       "1                                           0   \n",
       "\n",
       "   Eye-Melanoma::TCGA-WC-A88A-01A-11D-A39W-08  \\\n",
       "0                                           0   \n",
       "1                                           0   \n",
       "\n",
       "   Eye-Melanoma::TCGA-WC-AA9A-01A-11D-A39W-08  \\\n",
       "0                                           0   \n",
       "1                                           0   \n",
       "\n",
       "   Eye-Melanoma::TCGA-WC-AA9E-01A-11D-A39W-08  \\\n",
       "0                                           0   \n",
       "1                                           0   \n",
       "\n",
       "   Eye-Melanoma::TCGA-YZ-A980-01A-11D-A39W-08  \\\n",
       "0                                           0   \n",
       "1                                           0   \n",
       "\n",
       "   Eye-Melanoma::TCGA-YZ-A982-01A-11D-A39W-08  \\\n",
       "0                                           0   \n",
       "1                                           0   \n",
       "\n",
       "   Eye-Melanoma::TCGA-YZ-A983-01A-11D-A39W-08  \\\n",
       "0                                           0   \n",
       "1                                           1   \n",
       "\n",
       "   Eye-Melanoma::TCGA-YZ-A984-01A-11D-A39W-08  \\\n",
       "0                                           0   \n",
       "1                                           0   \n",
       "\n",
       "   Eye-Melanoma::TCGA-YZ-A985-01A-11D-A39W-08  \n",
       "0                                           0  \n",
       "1                                           0  \n",
       "\n",
       "[2 rows x 9495 columns]"
      ]
     },
jpronkko's avatar
jpronkko committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Performed by TCGA pipeline\n",
    "TCGA_wes_mut = pd.read_csv (\"./project_data/catalogs/WES/WES_TCGA.96.csv\")\n",
    "TCGA_wes_mut.head(2)"
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Cancer Types</th>\n",
       "      <th>Sample Names</th>\n",
       "      <th>Accuracy</th>\n",
       "      <th>SBS1</th>\n",
       "      <th>SBS2</th>\n",
       "      <th>SBS3</th>\n",
       "      <th>SBS4</th>\n",
       "      <th>SBS5</th>\n",
       "      <th>SBS6</th>\n",
       "      <th>SBS7a</th>\n",
       "      <th>...</th>\n",
       "      <th>SBS51</th>\n",
       "      <th>SBS52</th>\n",
       "      <th>SBS53</th>\n",
       "      <th>SBS54</th>\n",
       "      <th>SBS55</th>\n",
       "      <th>SBS56</th>\n",
       "      <th>SBS57</th>\n",
       "      <th>SBS58</th>\n",
       "      <th>SBS59</th>\n",
       "      <th>SBS60</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AML</td>\n",
       "      <td>TCGA-AB-2802-03B-01W-0728-08</td>\n",
       "      <td>0.811</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AML</td>\n",
       "      <td>TCGA-AB-2803-03B-01W-0728-08</td>\n",
       "      <td>0.608</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 68 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  Cancer Types                  Sample Names  Accuracy  SBS1  SBS2  SBS3  \\\n",
       "0          AML  TCGA-AB-2802-03B-01W-0728-08     0.811     3     0     0   \n",
       "1          AML  TCGA-AB-2803-03B-01W-0728-08     0.608     4     0     0   \n",
       "\n",
       "   SBS4  SBS5  SBS6  SBS7a  ...  SBS51  SBS52  SBS53  SBS54  SBS55  SBS56  \\\n",
       "0     0     0     0      0  ...      0      0      0      0      0      0   \n",
       "1     0     7     0      0  ...      0      0      0      0      0      0   \n",
       "\n",
       "   SBS57  SBS58  SBS59  SBS60  \n",
       "0      0      0      0      0  \n",
       "1      0      0      0      0  \n",
       "\n",
       "[2 rows x 68 columns]"
      ]
     },
jpronkko's avatar
jpronkko committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##Activities\n",
    "TCGA_wes_act = pd.read_csv(\"./project_data/activities/WES/WES_TCGA.activities.csv\")\n",
    "TCGA_wes_act.head(2)"
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Mutation type</th>\n",
       "      <th>Trinucleotide</th>\n",
       "      <th>ALL::TARGET-10-PAIXPH-03A-01D</th>\n",
       "      <th>ALL::TARGET-10-PAKHZT-03A-01R</th>\n",
       "      <th>ALL::TARGET-10-PAKMVD-09A-01D</th>\n",
       "      <th>ALL::TARGET-10-PAKSWW-03A-01D</th>\n",
       "      <th>ALL::TARGET-10-PALETF-03A-01D</th>\n",
       "      <th>ALL::TARGET-10-PALLSD-09A-01D</th>\n",
       "      <th>ALL::TARGET-10-PAMDKS-03A-01D</th>\n",
       "      <th>ALL::TARGET-10-PAPJIB-04A-01D</th>\n",
       "      <th>...</th>\n",
       "      <th>Head-SCC::V-109</th>\n",
       "      <th>Head-SCC::V-112</th>\n",
       "      <th>Head-SCC::V-116</th>\n",
       "      <th>Head-SCC::V-119</th>\n",
       "      <th>Head-SCC::V-123</th>\n",
       "      <th>Head-SCC::V-124</th>\n",
       "      <th>Head-SCC::V-125</th>\n",
       "      <th>Head-SCC::V-14</th>\n",
       "      <th>Head-SCC::V-29</th>\n",
       "      <th>Head-SCC::V-98</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>C&gt;A</td>\n",
       "      <td>ACA</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>C&gt;A</td>\n",
       "      <td>ACC</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 9693 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  Mutation type Trinucleotide  ALL::TARGET-10-PAIXPH-03A-01D  \\\n",
       "0           C>A           ACA                              0   \n",
       "1           C>A           ACC                              0   \n",
       "\n",
       "   ALL::TARGET-10-PAKHZT-03A-01R  ALL::TARGET-10-PAKMVD-09A-01D  \\\n",
       "0                              0                              0   \n",
       "1                              0                              0   \n",
       "\n",
       "   ALL::TARGET-10-PAKSWW-03A-01D  ALL::TARGET-10-PALETF-03A-01D  \\\n",
       "0                              1                              0   \n",
       "1                              1                              0   \n",
       "\n",
       "   ALL::TARGET-10-PALLSD-09A-01D  ALL::TARGET-10-PAMDKS-03A-01D  \\\n",
       "0                              0                              0   \n",
       "1                              0                              0   \n",
       "\n",
       "   ALL::TARGET-10-PAPJIB-04A-01D  ...  Head-SCC::V-109  Head-SCC::V-112  \\\n",
       "0                              2  ...                0                0   \n",
       "1                              0  ...                1                0   \n",
       "\n",
       "   Head-SCC::V-116  Head-SCC::V-119  Head-SCC::V-123  Head-SCC::V-124  \\\n",
       "0                0                0                0                0   \n",
       "1                0                0                0                0   \n",
       "\n",
       "   Head-SCC::V-125  Head-SCC::V-14  Head-SCC::V-29  Head-SCC::V-98  \n",
       "0                0               0               0               1  \n",
       "1                0               1               0               0  \n",
       "\n",
       "[2 rows x 9693 columns]"
      ]
     },
jpronkko's avatar
jpronkko committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "other_wes_mut = pd.read_csv(\"./project_data/catalogs/WES/WES_Other.96.csv\")\n",
    "other_wes_mut.head(2)"
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Cancer Types</th>\n",
       "      <th>Sample Names</th>\n",
       "      <th>Accuracy</th>\n",
       "      <th>SBS1</th>\n",
       "      <th>SBS2</th>\n",
       "      <th>SBS3</th>\n",
       "      <th>SBS4</th>\n",
       "      <th>SBS5</th>\n",
       "      <th>SBS6</th>\n",
       "      <th>SBS7a</th>\n",
       "      <th>...</th>\n",
       "      <th>SBS51</th>\n",
       "      <th>SBS52</th>\n",
       "      <th>SBS53</th>\n",
       "      <th>SBS54</th>\n",
       "      <th>SBS55</th>\n",
       "      <th>SBS56</th>\n",
       "      <th>SBS57</th>\n",
       "      <th>SBS58</th>\n",
       "      <th>SBS59</th>\n",
       "      <th>SBS60</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ALL</td>\n",
       "      <td>TARGET-10-PAIXPH-03A-01D</td>\n",
       "      <td>0.529</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ALL</td>\n",
       "      <td>TARGET-10-PAKHZT-03A-01R</td>\n",
       "      <td>0.696</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 68 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  Cancer Types              Sample Names  Accuracy  SBS1  SBS2  SBS3  SBS4  \\\n",
       "0          ALL  TARGET-10-PAIXPH-03A-01D     0.529     0     0     0     0   \n",
       "1          ALL  TARGET-10-PAKHZT-03A-01R     0.696     0     0     0     0   \n",
       "\n",
       "   SBS5  SBS6  SBS7a  ...  SBS51  SBS52  SBS53  SBS54  SBS55  SBS56  SBS57  \\\n",
       "0     0     0      0  ...      0      0      0      1      0      0      0   \n",
       "1     0     0      0  ...      0      0      0      1      0      0      0   \n",
       "\n",
       "   SBS58  SBS59  SBS60  \n",
       "0      0      0      0  \n",
       "1      0      0      0  \n",
       "\n",
       "[2 rows x 68 columns]"
      ]
     },
jpronkko's avatar
jpronkko committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "other_wes_act = pd.read_csv(\"./project_data/activities/WES/WES_Other.activities.csv\")\n",
    "other_wes_act.head(2)"
   ]
  },
jpronkko's avatar
jpronkko committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports and helpers"
   ]
  },
jpronkko's avatar
jpronkko committed
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [],
   "source": [
jpronkko's avatar
jpronkko committed
    "import re\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import sklearn\n",
    "from sklearn.decomposition import PCA\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
jpronkko's avatar
jpronkko committed
    "#import torch \n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
jpronkko's avatar
jpronkko committed
    "\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.metrics import roc_auc_score\n",
    "from sklearn.metrics import roc_curve\n",
jpronkko's avatar
jpronkko committed
    "from sklearn.metrics import classification_report\n",
    "\n",
jpronkko's avatar
jpronkko committed
    "from sklearn.model_selection import cross_val_score, train_test_split, KFold\n",
    "from sklearn.model_selection import StratifiedShuffleSplit\n",
    "from sklearn.model_selection import StratifiedKFold, GridSearchCV\n",
    "from sklearn.model_selection import learning_curve\n",
    "\n",
jpronkko's avatar
jpronkko committed
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "# These ones are work in progress\n",
    "def plot_roc_auc(X_tst, y_test, model, is_multi_class=False):\n",
    "    probs = model.predict_proba(X_tst)\n",
    "    probs = probs[:, 1]\n",
    "    if is_multi_class:\n",
    "        auc = roc_auc_score(y_test, probs, multi_class='ovo')\n",
    "    else:\n",
    "        auc = roc_auc_score(y_test, probs, multi_class='ovo')\n",
    "    \n",
    "    fp_rate, tp_rate, thresholds = roc_curve(y_test, probs)\n",
    "    \n",
    "    plt.figure(figsize=(7,6))\n",
    "    plt.axis('scaled')\n",
    "    plt.xlim([0,1])\n",
    "    plt.ylim([0,1])\n",
    "    plt.title(\"AUC & ROC\")\n",
    "    plt.plot(fp_rate, tp_rate, 'g')\n",
    "    plt.fill_between(fp_rate, tp_rate, facecolor = \"green\", alpha = 0.7)\n",
    "    plt.text(0.95, 0.05, f'AUC = {auc}', ha='right', fontsize=12, weight='bold', color='blue')\n",
    "    plt.xlabel(\"False Positive Rate\")\n",
    "    plt.ylabel(\"True Positive Rate\")\n",
    "\n",
jpronkko's avatar
jpronkko committed
    "def plot_confusion_mat(y_test, y_pred, labs=None, size=None):\n",
    "    cm = sklearn.metrics.confusion_matrix(y_test, y_pred)\n",
    "    if size is None:\n",
    "        plt.figure(figsize=(12,10))\n",
    "    else:\n",
    "        plt.figure(figsize=size)\n",
    "    if labs is None:\n",
    "        sns.heatmap(cm, square=False, annot=True, fmt='d', cmap='viridis', cbar=True)\n",
    "    else:\n",
    "        sns.heatmap(cm, square=False, annot=True, fmt='d', cmap='viridis', xticklabels=labs, yticklabels=labs, cbar=True)\n",
    "    plt.xlabel('Predicted label')\n",
    "    plt.ylabel('True label')\n",
    "    #plt.ylim(0, 2)\n",
    "\n",
    "def plot_learning_curve(model, X, y):\n",
    "    N, train_lc, val_lc = learning_curve(model, X, y, cv=7, train_sizes=np.linspace(0.3, 1, 25))\n",
    "    plt.figure(figsize=(7,6))\n",
jpronkko's avatar
jpronkko committed
    "    plt.title(\"Learning curve\")\n",
    "    plt.plot(N, np.mean(train_lc, 1), color='blue', label='training score')\n",
    "    plt.plot(N, np.mean(val_lc, 1), color='red', label='validation score')\n",
    "    #plt.hlines(N, np.mean([train_lc[-1],  val_lc[-1]]), N[0], N[-1], color='gray', label='mean', linestyle='dashed')\n",
    "\n",
    "def plot_trn_tst_dist(y_all, y_train, y_test, y_pred, in_cols=False):\n",
    "    #fig = None\n",
    "    #ax = None\n",
    "    if in_cols:\n",
    "        fig, ax = plt.subplots(2,2)\n",
    "    else:\n",
    "        fig, ax = plt.subplots(4,1)\n",
    "\n",
    "    fig.set_size_inches(15,8)\n",
    "\n",
    "    plt_sets = [y_all, y_train, y_test, y_pred]\n",
    "    plt_labels = [\"All\", \"Train\", \"Test\", \"Pred\"]\n",
    "    plt_set_df = pd.DataFrame()\n",
    "    for i in range(len(plt_sets)):\n",
    "        s = pd.Series(plt_sets[i]).value_counts().sort_index()\n",
    "        plt_set_df[plt_labels[i]] = s\n",
    "    \n",
    "        pd.DataFrame({plt_labels[i]: s}).plot(ax=ax.flat[i], kind=\"bar\")\n",
    "        #sns.countplot(x=s, \n",
    "        #            palette=sns.hls_palette(2),\n",
    "        #            ax=ax[i])\n",
    "        ax.flat[i].tick_params(axis=\"x\", rotation=90)\n",
    "\n",
    "    fig.tight_layout()\n",
    "    with pd.option_context('display.max_rows', None,\n",
    "                       'display.max_columns', None,\n",
    "                       'display.precision', 2,\n",
    "                       ):\n",
    "        print(plt_set_df)\n",
    "\n",
    "\n",
    "   \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset preprocess, combine profile data to a single data frame\n",
    "\n",
    "From all profile sets, a combined data frame is made, which has samples in the rows and features in the columns."
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Profile data:\n",
      "\n",
      "---Data set diagnostics print---\n",
      "\n",
      "Missing entries in mutations: 0\n",
      "The shape of the mutations data frame (23829, 97)\n",
jpronkko's avatar
jpronkko committed
      "Checking normalization: sum of some rows:\n",
      " Thymoma::TCGA-4V-A9QI-01A-11D-A423-09    1.0\n",
      "CNS::TCGA-06-0216-01B-01D-1492-08        1.0\n",
      "Prost-AdenoCA::SP114926                  1.0\n",
      "CNS::TCGA-06-1802-01A-01W-0643-08        1.0\n",
      "Sarcoma-bone::IC086T_WGS                 1.0\n",
jpronkko's avatar
jpronkko committed
      "dtype: float64\n",
      "\n",
      "\n",
      "Some tumor counts:\n",
      " Breast    1858\n",
      "Lung      1668\n",
      "CNS       1595\n",
      "Liver     1358\n",
      "Kidney    1269\n",
      "Name: tumor_types, dtype: int64\n",
      "\n",
      "\n",
      "Tumor types with smallish counts: 0\n",
      "Series([], Name: tumor_types, dtype: int64)\n",
jpronkko's avatar
jpronkko committed
      "\n",
      "\n",
      "Unique tumor types:  51\n",
      "['ALL', 'AML', 'Adrenal-neoplasm', 'Biliary-AdenoCA', 'Bladder-TCC', 'Blood-CMDI', 'Bone', 'Breast', 'CNS', 'CNS-NOS', 'Cervix', 'ColoRect-AdenoCA', 'ColoRect-Adenoma', 'DLBC', 'Eso-AdenoCA', 'Eso-SCC', 'Ewings', 'Eye', 'Head-SCC', 'Kidney', 'Liver', 'Lung', 'Lymph', 'Meninges-Meningioma', 'Mesothelium-Mesothelioma', 'Myeloid', 'Neuroblastoma', 'Oral-SCC', 'Ovary-AdenoCA', 'Panc', 'Para-AdenoCA', 'Para-Adenoma', 'Pheochromocytoma', 'Pit-All', 'Prost-AdenoCA', 'Prost-Adenoma', 'Sarcoma', 'Sarcoma-bone', 'Skin-BCC', 'Skin-Melanoma', 'Skin-SCC', 'Small-Intestine-carcinoid', 'SoftTissue-Leiomyo', 'SoftTissue-Liposarc', 'Stomach-AdenoCA', 'Testis-CA', 'Thy-AdenoCA', 'Thymoma', 'Transitional-cell-carcinoma', 'UCS', 'Uterus-AdenoCA']\n"
     ]
    }
   ],
   "source": [
    "\n",
    "def prepare_mut_df(raw_mutation_dfs, is_profile, small_sample_limit=None):\n",
    "\n",
    "    mutations_all = pd.DataFrame()\n",
    "\n",
    "    for df in raw_mutation_dfs:\n",
    "        # Make a copy of the original data frame and start processing from there\n",
    "        mutations  = df.copy()\n",
    "    \n",
    "        if is_profile:\n",
    "            mutations['mut_tri'] = mutations.apply(lambda a: '{}_{}'.format(a['Mutation type'], a['Trinucleotide']), axis=1)\n",
    "            mutations = mutations.set_index('mut_tri').drop(['Mutation type', 'Trinucleotide'], axis=1)\n",
    "            mutations = mutations.T\n",
    "        else:\n",
    "            mutations['mut_tri'] = mutations.apply(lambda a: '{}::{}'.format(a['Cancer Types'], a['Sample Names']), axis=1)\n",
    "            mutations = mutations.set_index('mut_tri').drop(['Cancer Types', 'Sample Names', 'Accuracy'], axis=1)\n",
    "     \n",
    "        # Rename some index names\n",
    "        renamed_items = list(mutations.index)\n",
    "        index_items = list(mutations.index)\n",
    "\n",
    "        # Combine rows for low count labels\n",
    "        for i in range(len(index_items)):\n",
    "            result = index_items[i]\n",
    "            for to_sub in ['Bone', 'Breast', 'Cervix', 'CNS', 'Eye', 'Liver', 'Lymph', 'Lung', 'Kidney', 'Myeloid', 'Panc' ]:\n",
    "                result = re.sub( to_sub + r'(-\\w*)', to_sub, result)\n",
    "                \n",
    "            renamed_items[i] = result.replace('Ca', 'CA')\n",
    "       \n",
    "        mutations.rename(index=dict(zip(index_items, renamed_items)), inplace = True)\n",
    "   \n",
    "        # Normalize \n",
    "        row_sums = mutations.sum(axis=1)\n",
    "        mutations = mutations.divide(row_sums, axis = 0)\n",
    "\n",
    "        mutations_all = pd.concat([mutations_all, mutations])\n",
    "\n",
    "    mutations_all.sort_index(inplace=True)\n",
    "\n",
    "    # Do we need to renormalize after obtaining the full dataframe?\n",
    "  \n",
    "    # Figure out tumor types based on the first part of the index\n",
    "    tumor_types = [a.split(':')[0] for a in mutations_all.index]\n",
    "    # Prepare a list with all the types appearing only once\n",
    "    unique_tumor_types = sorted(list(set(tumor_types)))\n",
    "    # Attach this back to the frame\n",
    "    mutations_all[\"tumor_types\"] = tumor_types\n",
    "\n",
    "    # Get rid of types with very few samples if the limit is specified\n",
    "    if small_sample_limit is not None:\n",
    "        counts = mutations_all[\"tumor_types\"].value_counts()\n",
    "        small_counts = list(counts[counts < small_sample_limit].index)\n",
    "        mutations_all = mutations_all.loc[~mutations_all[\"tumor_types\"].isin(small_counts)]\n",
    "\n",
    "    \n",
    "    return (mutations_all, unique_tumor_types)\n",
    "\n",
    "\n",
    "def print_dset_diag(mut_df, unique_tumor_types, small_sample_limit):\n",
    "    # Check if the data frame is ok\n",
    "    print(\"\\n---Data set diagnostics print---\\n\")\n",
    "    print(\"Missing entries in mutations:\", mut_df.isnull().sum().sum())\n",
    "    print(\"The shape of the mutations data frame\", mut_df.shape)\n",
    "\n",
    "    # Check to see if the rows are normalized to one, take a sample from the data frame\n",
    "    norm_df = mut_df.sample(n=5, random_state=5)\n",
    "    print(\"Checking normalization: sum of some rows:\\n\", norm_df.iloc[:,0:-1].sum(axis=1))\n",
    "    print(\"\\n\")\n",
    "\n",
    "    # Check some counts of tumor types\n",
    "    tumor_counts = mut_df[\"tumor_types\"].value_counts() #.sort_values(ascending=True)\n",
    "    print(\"Some tumor counts:\\n\", tumor_counts.head(5))\n",
    "    print(\"\\n\")\n",
    "\n",
    "    small_counts = tumor_counts < 1.5*small_sample_limit\n",
    "    print(\"Tumor types with smallish counts:\",  sum(small_counts))\n",
    "\n",
    "    print(tumor_counts[small_counts])\n",
    "    print(\"\\n\")\n",
    "\n",
    "    # Tumor types\n",
    "    print(\"Unique tumor types: \", len(unique_tumor_types))\n",
    "    print(unique_tumor_types)\n",
    "\n",
    "\n",
    "small_sample_limit = 0\n",
jpronkko's avatar
jpronkko committed
    "\n",
    "profile_raw_data_sets = [PCAWG_wgs_mut, TCGA_wes_mut, nonPCAWG_wgs_mut, other_wes_mut]\n",
    "profile_mut_all, prf_unique_tumor_types = prepare_mut_df(profile_raw_data_sets, True, small_sample_limit)\n",
    "\n",
    "# Print some diagnostics from the prepared data set\n",
    "print(\"Profile data:\")\n",
    "print_dset_diag(profile_mut_all, prf_unique_tumor_types, small_sample_limit)\n",
    "\n",
    "# Data matrix X for fitting, omit the tumor labeling from there, use that information in constructing true y\n",
    "# Note: this contains profile data only\n",
    "X_prf = profile_mut_all.drop(\"tumor_types\", axis=1)"
jpronkko's avatar
jpronkko committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset preprocess for activites data"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
jpronkko's avatar
jpronkko committed
      "Activities data:\n",
      "\n",
      "---Data set diagnostics print---\n",
      "\n",
      "Missing entries in mutations: 0\n",
      "The shape of the mutations data frame (23829, 66)\n",
      "Checking normalization: sum of some rows:\n",
jpronkko's avatar
jpronkko committed
      " mut_tri\n",
      "Thymoma::TCGA-4V-A9QI-01A-11D-A423-09    1.0\n",
      "CNS::TCGA-06-0216-01B-01D-1492-08        1.0\n",
      "Prost-AdenoCA::SP114926                  1.0\n",
      "CNS::TCGA-06-1802-01A-01W-0643-08        1.0\n",
      "Sarcoma-bone::IC086T_WGS                 1.0\n",
      "dtype: float64\n",
jpronkko's avatar
jpronkko committed
      "\n",
      "\n",
      "Some tumor counts:\n",
jpronkko's avatar
jpronkko committed
      " Breast    1858\n",
      "Lung      1668\n",
      "CNS       1595\n",
      "Liver     1358\n",
      "Kidney    1269\n",
      "Name: tumor_types, dtype: int64\n",
jpronkko's avatar
jpronkko committed
      "\n",
      "\n",
      "Tumor types with smallish counts: 0\n",
      "Series([], Name: tumor_types, dtype: int64)\n",
jpronkko's avatar
jpronkko committed
      "\n",
      "\n",
      "Unique tumor types:  51\n",
      "['ALL', 'AML', 'Adrenal-neoplasm', 'Biliary-AdenoCA', 'Bladder-TCC', 'Blood-CMDI', 'Bone', 'Breast', 'CNS', 'CNS-NOS', 'Cervix', 'ColoRect-AdenoCA', 'ColoRect-Adenoma', 'DLBC', 'Eso-AdenoCA', 'Eso-SCC', 'Ewings', 'Eye', 'Head-SCC', 'Kidney', 'Liver', 'Lung', 'Lymph', 'Meninges-Meningioma', 'Mesothelium-Mesothelioma', 'Myeloid', 'Neuroblastoma', 'Oral-SCC', 'Ovary-AdenoCA', 'Panc', 'Para-AdenoCA', 'Para-Adenoma', 'Pheochromocytoma', 'Pit-All', 'Prost-AdenoCA', 'Prost-Adenoma', 'Sarcoma', 'Sarcoma-bone', 'Skin-BCC', 'Skin-Melanoma', 'Skin-SCC', 'Small-Intestine-carcinoid', 'SoftTissue-Leiomyo', 'SoftTissue-Liposarc', 'Stomach-AdenoCA', 'Testis-CA', 'Thy-AdenoCA', 'Thymoma', 'Transitional-cell-carcinoma', 'UCS', 'Uterus-AdenoCA']\n"
     ]
    }
   ],
   "source": [
    "act_raw_data_sets = [PCAWG_wgs_act, TCGA_wes_act, nonPCAWG_wgs_act, other_wes_act]\n",
    "act_mut_all, act_unique_tumor_types = prepare_mut_df(act_raw_data_sets, is_profile=False, small_sample_limit=small_sample_limit)\n",
    "\n",
    "# Print some diagnostics from the prepared data set\n",
    "print(\"Activities data:\")\n",
    "print_dset_diag(act_mut_all, act_unique_tumor_types, small_sample_limit)\n",
    "\n",
    "# Data matrix X for fitting, omit the tumor labeling from there, use that information in constructing true y\n",
    "# Note: this contains profile data only\n",
    "X_act = act_mut_all.drop(\"tumor_types\", axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check profile data content"
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Some content from the full profile set:\n"
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>mut_tri</th>\n",
       "      <th>C&gt;A_ACA</th>\n",
       "      <th>C&gt;A_ACC</th>\n",
       "      <th>C&gt;A_ACG</th>\n",
       "      <th>C&gt;A_ACT</th>\n",
       "      <th>C&gt;A_CCA</th>\n",
       "      <th>C&gt;A_CCC</th>\n",
       "      <th>C&gt;A_CCG</th>\n",
       "      <th>C&gt;A_CCT</th>\n",
       "      <th>C&gt;A_GCA</th>\n",
       "      <th>C&gt;A_GCC</th>\n",
       "      <th>...</th>\n",
       "      <th>T&gt;G_CTT</th>\n",
       "      <th>T&gt;G_GTA</th>\n",
       "      <th>T&gt;G_GTC</th>\n",
       "      <th>T&gt;G_GTG</th>\n",
       "      <th>T&gt;G_GTT</th>\n",
       "      <th>T&gt;G_TTA</th>\n",
       "      <th>T&gt;G_TTC</th>\n",
       "      <th>T&gt;G_TTG</th>\n",
       "      <th>T&gt;G_TTT</th>\n",
       "      <th>tumor_types</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
jpronkko's avatar
jpronkko committed
       "      <th>ALL::11</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.133333</td>\n",
       "      <td>0.066667</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
jpronkko's avatar
jpronkko committed
       "      <td>0.066667</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.066667</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
jpronkko's avatar
jpronkko committed
       "      <th>ALL::2211636</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
jpronkko's avatar
jpronkko committed
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
jpronkko's avatar
jpronkko committed
       "      <th>ALL::2211638</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
jpronkko's avatar
jpronkko committed
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
jpronkko's avatar
jpronkko committed
       "      <th>ALL::2211640</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
jpronkko's avatar
jpronkko committed
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
jpronkko's avatar
jpronkko committed
       "      <th>ALL::2211642</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
jpronkko's avatar
jpronkko committed
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 97 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
jpronkko's avatar
jpronkko committed
       "mut_tri       C>A_ACA  C>A_ACC  C>A_ACG  C>A_ACT  C>A_CCA  C>A_CCC  C>A_CCG  \\\n",
       "ALL::11           0.0      0.0      0.0      0.0      0.0      0.0      0.0   \n",
       "ALL::2211636      0.0      0.0      0.0      0.0      0.0      0.0      0.0   \n",
       "ALL::2211638      0.0      0.0      0.0      0.0      0.0      0.0      0.0   \n",
       "ALL::2211640      0.0      0.0      0.0      0.0      0.0      0.0      0.0   \n",
       "ALL::2211642      0.0      0.0      0.0      0.0      0.0      0.0      0.0   \n",
       "\n",
       "mut_tri        C>A_CCT   C>A_GCA  C>A_GCC  ...   T>G_CTT  T>G_GTA   T>G_GTC  \\\n",
       "ALL::11       0.133333  0.066667      0.0  ...  0.066667      0.0  0.066667   \n",
       "ALL::2211636  0.000000  0.000000      0.0  ...  0.000000      0.0  0.000000   \n",
       "ALL::2211638  0.000000  0.000000      0.0  ...  0.000000      0.0  0.000000   \n",
       "ALL::2211640  0.000000  0.000000      0.0  ...  0.000000      0.0  0.000000   \n",
       "ALL::2211642  0.000000  0.000000      0.0  ...  0.000000      0.0  0.000000   \n",
       "\n",
       "mut_tri       T>G_GTG  T>G_GTT   T>G_TTA  T>G_TTC  T>G_TTG  T>G_TTT  \\\n",
       "ALL::11           0.0      0.0  0.000000      0.0      0.0      0.0   \n",
       "ALL::2211636      0.0      0.0  0.000000      0.0      0.0      0.0   \n",
       "ALL::2211638      0.0      0.0  0.333333      0.0      0.0      0.0   \n",
       "ALL::2211640      0.0      0.0  0.000000      0.0      0.0      0.0   \n",
       "ALL::2211642      0.0      0.0  0.000000      0.0      0.0      0.0   \n",
       "\n",
       "mut_tri       tumor_types  \n",
       "ALL::11               ALL  \n",
       "ALL::2211636          ALL  \n",
       "ALL::2211638          ALL  \n",
       "ALL::2211640          ALL  \n",
       "ALL::2211642          ALL  \n",
       "\n",
       "[5 rows x 97 columns]"
      ]
     },
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
jpronkko's avatar
jpronkko committed
    "print(\"Some content from the full profile set:\")\n",
    "profile_mut_all.head(5)"
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 1800x360 with 1 Axes>"
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.figure(figsize=(25, 5))\n",
jpronkko's avatar
jpronkko committed
    "sns.set_theme()\n",
    "profile_mut_all[\"tumor_types\"].value_counts().sort_index().plot(kind=\"bar\")\n",
    "#sns.countplot(x=profile_mut_all[\"tumor_types\"], palette=sns.hls_palette(2))\n",
    "plt.xticks(rotation=90);\n"
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
jpronkko's avatar
jpronkko committed
    "### Check activites data content"
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
jpronkko's avatar
jpronkko committed
      "Some content from the full act set:\n"
jpronkko's avatar
jpronkko committed
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SBS1</th>\n",
       "      <th>SBS2</th>\n",
       "      <th>SBS3</th>\n",
       "      <th>SBS4</th>\n",
       "      <th>SBS5</th>\n",
       "      <th>SBS6</th>\n",
       "      <th>SBS7a</th>\n",
       "      <th>SBS7b</th>\n",
       "      <th>SBS7c</th>\n",
       "      <th>SBS7d</th>\n",
       "      <th>...</th>\n",
       "      <th>SBS52</th>\n",
       "      <th>SBS53</th>\n",
       "      <th>SBS54</th>\n",
       "      <th>SBS55</th>\n",
       "      <th>SBS56</th>\n",
       "      <th>SBS57</th>\n",
       "      <th>SBS58</th>\n",
       "      <th>SBS59</th>\n",
       "      <th>SBS60</th>\n",
       "      <th>tumor_types</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mut_tri</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ALL::11</th>\n",
       "      <td>0.066667</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.066667</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ALL::2211636</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ALL::2211638</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.666667</td>\n",
       "      <td>0.0</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ALL::2211640</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ALL::2211642</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.250000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>ALL</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 66 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  SBS1  SBS2  SBS3  SBS4      SBS5  SBS6  SBS7a  SBS7b  SBS7c  \\\n",
       "mut_tri                                                                         \n",
       "ALL::11       0.066667   0.0   0.0   0.0  0.066667   0.0    0.0    0.0    0.0   \n",
       "ALL::2211636  0.000000   0.0   0.0   0.0  0.000000   0.0    0.0    0.0    0.0   \n",
       "ALL::2211638  0.000000   0.0   0.0   0.0  0.333333   0.0    0.0    0.0    0.0   \n",
       "ALL::2211640  0.000000   0.0   0.0   0.0  0.000000   0.0    0.0    0.0    0.0   \n",
       "ALL::2211642  0.000000   0.0   0.0   0.0  0.250000   0.0    0.0    0.0    0.0   \n",
       "\n",
       "              SBS7d  ...  SBS52  SBS53  SBS54  SBS55  SBS56  SBS57  SBS58  \\\n",
       "mut_tri              ...                                                    \n",
       "ALL::11         0.0  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0   \n",
       "ALL::2211636    0.0  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0   \n",
       "ALL::2211638    0.0  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0   \n",
       "ALL::2211640    0.0  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0   \n",
       "ALL::2211642    0.0  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0   \n",
       "\n",
       "                 SBS59  SBS60  tumor_types  \n",
       "mut_tri                                     \n",
       "ALL::11       0.000000    0.0          ALL  \n",
       "ALL::2211636  0.000000    0.0          ALL  \n",
       "ALL::2211638  0.666667    0.0          ALL  \n",
       "ALL::2211640  0.000000    0.0          ALL  \n",
       "ALL::2211642  0.000000    0.0          ALL  \n",
       "\n",
       "[5 rows x 66 columns]"
      ]
     },
jpronkko's avatar
jpronkko committed
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(\"Some content from the full act set:\")\n",
    "act_mut_all.head(5)"
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "",
jpronkko's avatar
jpronkko committed
      "text/plain": [
       "<Figure size 1800x360 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.figure(figsize=(25, 5))\n",
    "sns.set_theme()\n",
    "act_mut_all[\"tumor_types\"].value_counts().sort_index().plot(kind=\"bar\")\n",
    "#sns.countplot(x=profile_mut_all[\"tumor_types\"], palette=sns.hls_palette(2))\n",
    "plt.xticks(rotation=90);\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing with a single RandomForest binary classifier"
   ]
  },
  {
   "cell_type": "code",
jpronkko's avatar
jpronkko committed
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dimension of the training data (16680, 96) and test data (7149, 96)\n",
jpronkko's avatar
jpronkko committed
      "     All  Train  Test    Pred\n",
      "0  23342  16351  6991  7149.0\n",
      "1    487    329   158     NaN\n",
      "Accuracy: 0.9778990068541055\n",
jpronkko's avatar
jpronkko committed
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.98      1.00      0.99      6991\n",
      "           1       0.00      0.00      0.00       158\n",
jpronkko's avatar
jpronkko committed
      "\n",
      "    accuracy                           0.98      7149\n",
      "   macro avg       0.49      0.50      0.49      7149\n",
      "weighted avg       0.96      0.98      0.97      7149\n",
jpronkko's avatar
jpronkko committed
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jr/miniconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "/home/jr/miniconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "/home/jr/miniconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAABCwAAAI0CAYAAADWR7hcAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8/fFQqAAAACXBIWXMAAAsTAAALEwEAmpwYAABCWklEQVR4nO39fXRd9X0n+r+lo9g8WZUlJCHArceeQkTaCbcwodPbJFN7itNERrmT/pYyapqHBrgppdeZhDtRSCIZbNorAjdNBhzaGYaGhOSmNC2uRVqR/phfp0kzado8TIjyK9TICQnCMpK9bAjB+OjcPxhUnoLtI8ln65zXa62shfdHW/qceB/5c957f/duqlQqlQAAAAAUSHOtGwAAAAB4PoEFAAAAUDgCCwAAAKBwBBYAAABA4QgsAAAAgMIRWAAAAACFI7AAAAAACqel1g0shf37H8/cXKXWbbDIOjpOy8zMY7VuAzgO3rf1qbm5KatXn1rrNk4IM0V98rsJlh/v2/p0tJmiLgOLubmK4aJO+XuF5cf7luXMTFG//L3C8uN923gsCQEAAAAKR2ABAAAAFE5dLgkBgFool49k//59OXLkcK1bWRQtLSuyenVnSiXjAgAsRL3NCNWoZq4wgQDAItm/f19OOumUnHrqGWlqaqp1OwtSqVTy+OMHs3//vpx+ek+t2wGAZa2eZoRqVDtXWBICAIvkyJHDOfXU1roYRJqamnLqqa0NfSYIABZLPc0I1ah2rhBYAMAiqqdBpJ5eCwDUWqP/u1rN6xdYAAAAAIXjHhYAsERWtZ6ck1Yu/j+1P3rySA4dfOKYvvbgwYN54xtfl/7+N2XLlvcmSW699ffzxBNP5Mor353Pf35X/uZv/jrbt1+/6H0CAC+uljPCZZe9LU899VSOHHkqDz30vfyzf7Y+SXLOOefm6qtHjvoz7rrrj/Pkk09mYODXFqXnlyKwAIAlctLKlmx+785F/767buzPoWP82i984c/zilf8bP7yL8dzxRX/R172spctej8AwPGp5Yzwn/7TJ5IkU1MP59JLfz1/+Ieffk79yJEjaWn58VHBG9/4qwtt85gJLJaxpUrliqyzc1WtWzghjufsKcBLufvuP8sVV2zJJz/5h/niF/8qv/RL/6bWLVFQjTZXNMpMkZgrgKP71V/dnL6+/vz93381Z555Vi6//Ips3fqBPP744zl8+HB+4Rf+11xxxZYkL7xS8wtf+IusWtWaBx/cnVWrTsv27deno+P0Remrcf5VqkNLlcpRe8dz9hTgx3nggftz8ODBXHDBv8zs7EzuvvvPah5YjI6OZnx8PD/4wQ+ya9eunHPOOUmSJ598Mr/zO7+TL3/5y1m5cmXOP//8bNu2LUkyOTmZoaGhHDhwIG1tbRkdHc3atWsXVOOFzBX1y1wBHItHH300//E//n6Sp/9dHh39SE455ZQcOXIk73nPlfnv//1v8vM//wsv2O8735nIJz7xmXR3n5HR0e354z/+bP73//23FqUnN90EgDp1990787rXvSFNTU157Wt/Kd/+9n3Zt2+6pj1t3Lgxd9xxR84666znbP/whz+clStXZnx8PLt27cqWLVvmayMjIxkcHMz4+HgGBwczPDy84BoA8Fyve90b5v97bm4uO3Z8NG9727/LO9/5ljz44O488MD9L7rfv/gXr0x39xlJkle84mfy8MPfX7SeBBYAUIeeeuqpfOELf5G77/6z/Oqvbs6v/dr/J0eOHMmf//lYTfu68MIL09PT85xtjz/+eO66665s2bJl/pFnp5/+9KWkMzMzmZiYSF9fX5Kkr68vExMTmZ2drboGALzQKaecPP/fn/3sHTl06GD+4A/+MJ/4xP+TV7/6X+fw4SdfdL8VK1bM/3dzcynlcnnRerIkBADq0H/7b/+//ORPrs3HP37r/Lb77vsf2b59JL/8y6+rYWcv9NBDD6WtrS033XRTvvKVr+TUU0/Nli1bcuGFF2Zqaird3d0plUpJklKplK6urkxNTaVSqVRVa29vr9lrBYDl4NChQ+noOD0rV67Mvn3T+eIX/ypvfOObTngfAgsAqEOf//yuXHzxrzxn28/8zL/I3NxcvvnNr+ecc15eo85e6MiRI3nooYdy3nnn5X3ve1+++c1v5l3vele+8IUv1Lq1dHScVusWYFE00k1GqV/L+Tienm5OS8uJWeBwrD+nVGpO0jT/9aXSP/X45jf/u3zgA+/Lb/zGr6Wrqzv/8l++Ks3NT39tc3PTc/67qemfvsfz//x8zc3Nx/X3KLAAgCXyoyePZNeN/UvyfY/mxhs/9qLb/+iPnntTxde/fnNe//rNi9JXtc4888y0tLTML9945StfmdWrV2dycjJnnnlm9u7dm3K5nFLp6ctMp6en09PTk0qlUlXteMzMPJa5ucpSvOzCWc4fBDi6ffvcdpPlrbNz1bI+jufm5nLkyNz8n5dyRnj2z3kpXV1n5O67/zJHjszlj/94V5LM79vZeUb+4A8+8YJ9jhyZyzvecfn8f7/udX153ev65vd7/p+fb25u7jl/j83NTS95ckBgAQBL5NDBJ9yZ/xi0t7fnoosuype+9KX84i/+YiYnJzMzM5Of+qmfSmtra3p7ezM2Npb+/v6MjY2lt7d3fllHtTUAqCUzwrERWAAAJ8z27dtzzz335NFHH8073vGOtLW15e67784111yTq6++OqOjo2lpacn111+f1tbWJMnWrVszNDSUHTt2pLW1NaOjo/Pfr9oaAFB8AgsA4IT54Ac/mA9+8IMv2L5mzZp88pOffNF91q9fnzvvvHNRawBA8XmsKQAsokqlfu53UE+vBQBqrdH/Xa3m9QssAGCRtLSsyOOPH6yLgaRSqeTxxw+mpWXF0b8YAHhJ9TQjVKPaucKSEABYJKtXd2b//n157LEDtW5lUbS0rMjq1Z21bgMAlr16mxGqUc1cIbAAgEVSKrXk9NOP77GZAED9MyNUx5IQAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHCOGljs378/l112WTZt2pTNmzfnyiuvzOzsbJJkcnIyAwMD2bRpUwYGBrJnz575/ZaiBgAsb6Ojo9mwYUPOPffc3H///S+o33TTTS+omSkAoDEdNbBoamrKpZdemvHx8ezatStr1qzJDTfckCQZGRnJ4OBgxsfHMzg4mOHh4fn9lqIGACxvGzduzB133JGzzjrrBbVvf/vb+cY3vpEzzzzzOdvNFADQmI4aWLS1teWiiy6a//P555+fhx9+ODMzM5mYmEhfX1+SpK+vLxMTE5mdnV2SGgCw/F144YXp6el5wfbDhw/n2muvzcjISJqamua3mykAoHG1HM8Xz83N5TOf+Uw2bNiQqampdHd3p1QqJUlKpVK6uroyNTWVSqWy6LX29vZj7rOj47TjeVlQSJ2dq2rdAiwKxzLH4qMf/WguueSSrFmz5jnbl2LeMFPQiPwuph44jhvPcQUW27ZtyymnnJK3vOUtmZiYWKqeFmxm5rHMzVVq3caS84atb/v2Hap1C7BgnZ2rHMt1qLm5aVE/yH/961/Pt771rVx11VWL9j0XS6PMFIm5ot75XcxyZ6aoT0ebKY45sBgdHc13v/vd3HLLLWlubk5PT0/27t2bcrmcUqmUcrmc6enp9PT0pFKpLHoNAKhPX/3qV/Pggw9m48aNSZJHHnkk73znO/O7v/u76e3tNVMAQIM6pseafuQjH8l9992Xm2++OStWrEiSdHR0pLe3N2NjY0mSsbGx9Pb2pr29fUlqAEB9uvzyy/PFL34x9957b+69996cccYZufXWW/OLv/iLZgoAaGBNlUrlJa9zfOCBB9LX15e1a9fmpJNOSpKcffbZufnmm7N79+4MDQ3l4MGDaW1tzejoaNatW5ckS1I7Vo1y+WZn56psfu/OWrfBEth1Y79L3qgLLt+sTwtZErJ9+/bcc889efTRR7N69eq0tbXl7rvvfs7XbNiwIbfcckvOOeecJGaKE8VcUb/MFdQDM0V9OtpMcdTAYjlqlOHCYFG/DBbUC8NFfVrse1gUWaPMFIm5op6ZK6gHZor6dLSZ4piWhAAAAACcSAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAnzOjoaDZs2JBzzz03999/f5Jk//79ueyyy7Jp06Zs3rw5V155ZWZnZ+f3mZyczMDAQDZt2pSBgYHs2bNnwTUAoPgEFgDACbNx48bccccdOeuss+a3NTU15dJLL834+Hh27dqVNWvW5IYbbpivj4yMZHBwMOPj4xkcHMzw8PCCawBA8QksAIAT5sILL0xPT89ztrW1teWiiy6a//P555+fhx9+OEkyMzOTiYmJ9PX1JUn6+voyMTGR2dnZqmsAwPLQUusGAACeMTc3l8985jPZsGFDkmRqaird3d0plUpJklKplK6urkxNTaVSqVRVa29vr82LAwCOi8ACACiMbdu25ZRTTslb3vKWWreSJOnoOK3WLcCi6OxcVesWYMEcx41HYAEAFMLo6Gi++93v5pZbbklz89OrVnt6erJ3796Uy+WUSqWUy+VMT0+np6cnlUqlqtrxmJl5LHNzlaV4uYXjg0B927fvUK1bgAXp7FzlOK5Dzc1NL3lywD0sAICa+8hHPpL77rsvN998c1asWDG/vaOjI729vRkbG0uSjI2Npbe3N+3t7VXXAIDloalSqdTdaYNGORvS2bkqm9+7s9ZtsAR23dgvQaYuOBtSn452NuSlbN++Pffcc08effTRrF69Om1tbfm93/u99PX1Ze3atTnppJOSJGeffXZuvvnmJMnu3bszNDSUgwcPprW1NaOjo1m3bt2CaseqUWaKxFxRz8wV1AMzRX062kwhsFjGDBb1y2BBvTBc1KeFBBbLTaPMFIm5op6ZK6gHZor6ZEkIAAAAsOwILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUzlEDi9HR0WzYsCHnnntu7r///vntGzZsyOte97r09/env78/f/3Xfz1fm5yczMDAQDZt2pSBgYHs2bNnwTUAAACgcRw1sNi4cWPuuOOOnHXWWS+ofexjH8vOnTuzc+fOvPrVr57fPjIyksHBwYyPj2dwcDDDw8MLrgEAAACN46iBxYUXXpienp5j/oYzMzOZmJhIX19fkqSvry8TExOZnZ2tugYAAAA0lpaF7HzVVVelUqnkggsuyHve8560trZmamoq3d3dKZVKSZJSqZSurq5MTU2lUqlUVWtvb1/gywQAAACWk6oDizvuuCM9PT05fPhwrrvuulx77bW54YYbFrO3qnV0nFbrFmDBOjtX1boFWBSOZQAAqlF1YPHMMpEVK1ZkcHAwv/mbvzm/fe/evSmXyymVSimXy5menk5PT08qlUpVteM1M/NY5uYq1b60ZcOHgPq2b9+hWrcAC9bZucqxXIeam5ucHAAAllxVjzX94Q9/mEOHnh5AK5VKPv/5z6e3tzdJ0tHRkd7e3oyNjSVJxsbG0tvbm/b29qprAAAAQGM5amCxffv2vOY1r8kjjzySd7zjHXnDG96QmZmZ/Pqv/3o2b96cvr6+TE5OZmRkZH6frVu35lOf+lQ2bdqUT33qU7nmmmsWXAMAlr8f97j0pXgkuselA8Dy1lSpVOpu7UQjLQnZ/N6dtW6DJbDrxn6X0VMXLAmpTwtZEvJ3f/d3Oeuss/Jrv/ZrueWWW3LOOeckSd761rfmTW96U/r7+7Nz58587nOfy+23375ktWPVKDNFYq6oZ+YK6oGZoj4dbaaoakkIAEA1Xuxx6UvxSHSPSweA5W9BjzUFAFiopXgkuselA8DyJ7AAAPgxPA2FeuHpctQDx3HjEVgAADW1FI9EX6zHpTfaPSyoX9b+s9y5h0V9cg8LAKDQluKR6B6XDgDLn6eELGPu5l2/3M2beuFsSH1ayFNCtm/fnnvuuSePPvpoVq9enba2ttx9993ZvXt3hoaGcvDgwbS2tmZ0dDTr1q1LkiWpHatGmSkSc0U9M1dQD8wU9eloM4XAYhkzWNQvgwX1wnBRnxYSWCw3jTJTJOaKemauoB6YKeqTJSEAAADAsiOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAoBD+63/9r3njG9+Y/v7+bN68Offcc0+SZHJyMgMDA9m0aVMGBgayZ8+e+X2qrQEAxSewAABqrlKp5D/8h/+Q66+/Pjt37syHP/zhvO9978vc3FxGRkYyODiY8fHxDA4OZnh4eH6/amsAQPEJLACAQmhubs6hQ4eSJIcOHUpXV1f279+fiYmJ9PX1JUn6+voyMTGR2dnZzMzMVFUDAJaHllo3AADQ1NSU3/u938sVV1yRU045JY8//nh+//d/P1NTU+nu7k6pVEqSlEqldHV1ZWpqKpVKpapae3v7MffV0XHa4r9YqIHOzlW1bgEWzHHceAQWAEDNHTlyJL//+7+fHTt25IILLsjf//3f59//+3+f66+/vqZ9zcw8lrm5Sk17OFF8EKhv+/YdqnULsCCdnascx3WoubnpJU8OCCwAgJr7zne+k+np6VxwwQVJkgsuuCAnn3xyVq5cmb1796ZcLqdUKqVcLmd6ejo9PT2pVCpV1QCA5cE9LACAmjvjjDPyyCOP5MEHH0yS7N69O48++mh+6qd+Kr29vRkbG0uSjI2Npbe3N+3t7eno6KiqBgAsD66wAABqrrOzM1u3bs2WLVvS1NSUJPnd3/3dtLW1ZevWrRkaGsqOHTvS2tqa0dHR+f2qrQEAxSewAAAK4ZJLLskll1zygu3r16/PnXfe+aL7VFsDAIrPkhAAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAArnqIHF6OhoNmzYkHPPPTf333///PbJyckMDAxk06ZNGRgYyJ49e5a0BgAAADSOowYWGzduzB133JGzzjrrOdtHRkYyODiY8fHxDA4OZnh4eElrAAAAQOM4amBx4YUXpqen5znbZmZmMjExkb6+viRJX19fJiYmMjs7uyQ1AAAAoLG0VLPT1NRUuru7UyqVkiSlUildXV2ZmppKpVJZ9Fp7e/tivFYAAABgmagqsCi6jo7Tat0CLFhn56patwCLwrEMAEA1qgosenp6snfv3pTL5ZRKpZTL5UxPT6enpyeVSmXRa8drZuaxzM1Vqnlpy4oPAfVt375DtW4BFqyzc5VjuQ41Nzc5OQAALLmqHmva0dGR3t7ejI2NJUnGxsbS29ub9vb2JakBAAAAjaWpUqm85KUI27dvzz333JNHH300q1evTltbW+6+++7s3r07Q0NDOXjwYFpbWzM6Opp169YlyZLUjkcjXWGx+b07a90GS2DXjf3OSlMXXGFRnxrpCotGmSkSc0U9M1dQD8wU9eloM8VRA4vlqFGGC4NF/TJYUC8MF/VJYFGfzBX1y1xBPTBT1KejzRRVLQkBAAAAWEoCCwAAAKBwBBYAAABA4QgsAAAAgMIRWAAAAACFI7AAAAAACkdgAQAUwpNPPpmRkZFcfPHF2bx5cz70oQ8lSSYnJzMwMJBNmzZlYGAge/bsmd+n2hoAUHwCCwCgED784Q9n5cqVGR8fz65du7Jly5YkycjISAYHBzM+Pp7BwcEMDw/P71NtDQAoPoEFAFBzjz/+eO66665s2bIlTU1NSZLTTz89MzMzmZiYSF9fX5Kkr68vExMTmZ2drboGACwPLbVuAADgoYceSltbW2666aZ85StfyamnnpotW7bkpJNOSnd3d0qlUpKkVCqlq6srU1NTqVQqVdXa29tr9joBgGMnsAAAau7IkSN56KGHct555+V973tfvvnNb+Zd73pXPvrRj9a0r46O02r682GxdHauqnULsGCO48YjsAAAau7MM89MS0vL/BKOV77ylVm9enVOOumk7N27N+VyOaVSKeVyOdPT0+np6UmlUqmqdjxmZh7L3FxlKV5y4fggUN/27TtU6xZgQTo7VzmO61Bzc9NLnhxwDwsAoOba29tz0UUX5Utf+lKSp5/wMTMzk7Vr16a3tzdjY2NJkrGxsfT29qa9vT0dHR1V1QCA5aGpUqnU3WmDRjkb0tm5Kpvfu7PWbbAEdt3YL0GmLjgbUp+OdjakWg899FCuvvrqHDhwIC0tLXn3u9+d1772tdm9e3eGhoZy8ODBtLa2ZnR0NOvWrUuSqmvHqlFmisRcUc/MFdQDM0V9OtpMYUkIAFAIa9asySc/+ckXbF+/fn3uvPPOF92n2hoAUHyWhAAAAACFI7AAAAAACkdgAQAAABSOwAIAAAAoHIEFAAAAUDgCCwAAAKBwBBYAAABA4QgsAAAAgMIRWAAAAACFI7AAAAAACkdgAQAAABSOwAIAAAAoHIEFAAAAUDgCCwAAAKBwBBYAAABA4QgsAAAAgMIRWAAAAACFI7AAAAAACkdgAQAAABSOwAIAAAAoHIEFAAAAUDgCCwCgUG666aace+65uf/++5Mkk5OTGRgYyKZNmzIwMJA9e/bMf221NQCg+AQWAEBhfPvb3843vvGNnHnmmfPbRkZGMjg4mPHx8QwODmZ4eHjBNQCg+AQWAEAhHD58ONdee21GRkbS1NSUJJmZmcnExET6+vqSJH19fZmYmMjs7GzVNQBgeWipdQMAAEny0Y9+NJdccknWrFkzv21qaird3d0plUpJklKplK6urkxNTaVSqVRVa29vP+aeOjpOW8RXCLXT2bmq1i3AgjmOG4/AAgCoua9//ev51re+lauuuqrWrTzHzMxjmZur1LqNE8IHgfq2b9+hWrcAC9LZucpxXIeam5te8uSAwAIAqLmvfvWrefDBB7Nx48YkySOPPJJ3vvOdef/735+9e/emXC6nVCqlXC5neno6PT09qVQqVdUAgOXBPSwAgJq7/PLL88UvfjH33ntv7r333pxxxhm59dZb8/rXvz69vb0ZGxtLkoyNjaW3tzft7e3p6OioqgYALA8LvsJiw4YNWbFiRVauXJkkueqqq/LqV786k5OTGRoayoEDB9LW1pbR0dGsXbs2SaquAQCNZ+vWrRkaGsqOHTvS2tqa0dHRBdcAgOJrqlQqC1qYuWHDhtxyyy0555xznrP9rW99a970pjelv78/O3fuzOc+97ncfvvtC6odq0ZZb9rZuSqb37uz1m2wBHbd2G+NHnXBetP6dLT1pvWkUWaKxFxRz8wV1AMzRX062kyxJEtCPIIMAAAAWIhFuenmVVddlUqlkgsuuCDvec97PIIMFoG7tVMvHMsAAFRjwYHFHXfckZ6enhw+fDjXXXddrr322rz97W9fhNaq1yiXb/oQUN9c8kY9cPlmfWqkJSEAQO0seEnIM48HW7FiRQYHB/O1r30tPT09848SS/KcR4lVWwMAAAAax4ICix/+8Ic5dOjpM2eVSiWf//zn09vbW/VjxjyCDAAAAEgWuCRkZmYmv/3bv51yuZy5ubmsX78+IyMjSTyCDAAAAKjeggKLNWvW5K677nrR2vr163PnnXcuag0AAABoDEvyWFMAAACAhRBYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAUAj79+/PZZddlk2bNmXz5s258sorMzs7mySZnJzMwMBANm3alIGBgezZs2d+v2prAECxCSwAgEJoamrKpZdemvHx8ezatStr1qzJDTfckCQZGRnJ4OBgxsfHMzg4mOHh4fn9qq0BAMUmsAAACqGtrS0XXXTR/J/PP//8PPzww5mZmcnExET6+vqSJH19fZmYmMjs7GzVNQCg+Fpq3QAAwPPNzc3lM5/5TDZs2JCpqal0d3enVColSUqlUrq6ujI1NZVKpVJVrb29/Zj66Og4bWleIJxgnZ2rat0CLJjjuPEILACAwtm2bVtOOeWUvOUtb8nExETN+piZeSxzc5Wa/fwTyQeB+rZv36FatwAL0tm5ynFch5qbm17y5IDAAgAolNHR0Xz3u9/NLbfckubm5vT09GTv3r0pl8splUopl8uZnp5OT09PKpVKVTUAoPjcwwIAKIyPfOQjue+++3LzzTdnxYoVSZKOjo709vZmbGwsSTI2Npbe3t60t7dXXQMAis8VFgBAITzwwAO55ZZbsnbt2rz5zW9Okpx99tm5+eabs3Xr1gwNDWXHjh1pbW3N6Ojo/H7V1gCAYhNYAACF8NM//dP5h3/4hxetrV+/Pnfeeeei1gCAYrMkBAAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHBaat0AQCNZ1XpyTlrZWL96OztX1bqFE+JHTx7JoYNP1LoNABpIo80VjTJTJOaKZxTy6J6cnMzQ0FAOHDiQtra2jI6OZu3atbVuC2DBTlrZks3v3VnrNlgCu27sz6FaN8ELmCmAemauqF/miqcVcknIyMhIBgcHMz4+nsHBwQwPD9e6JQBgGTJTAMDyVbjAYmZmJhMTE+nr60uS9PX1ZWJiIrOzszXuDABYTswUALC8FW5JyNTUVLq7u1MqlZIkpVIpXV1dmZqaSnt7+zF9j+bmpqVssVC6Vp9c6xZYIo10HDca79v61Sjv2+XyOs0Ux8/vp/rVaMdyI/G+rV+N8L492mssXGCxGFavPrXWLZwwt37w4lq3wBLp6Dit1i2wRLxv65f3bf1ppJki8fupnvn9VL+8b+uX920Bl4T09PRk7969KZfLSZJyuZzp6en09PTUuDMAYDkxUwDA8la4wKKjoyO9vb0ZGxtLkoyNjaW3t/eYL90EAEjMFACw3DVVKpVKrZt4vt27d2doaCgHDx5Ma2trRkdHs27dulq3BQAsM2YKAFi+ChlYAAAAAI2tcEtCAAAAAAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUTkutG4AfZ//+/XnkkUeSJGeccUZWr15d444AgOXKXAGw/AgsKJzvfe97+dCHPpSJiYl0dXUlSaanp3Peeeflmmuuydq1a2vbIACwbJgrAJavpkqlUql1E/Bsb37zmzM4OJi+vr40Nz+9amlubi67du3Kpz/96Xz2s5+tcYfA8dq8eXN27dpV6zaABmSugPpipmgsrrCgcA4cOJBLLrnkOduam5vT39+fj3/84zXqCjiaf/zHf/yxtf3795/ATgD+ibkClh8zBc8QWFA4bW1tGRsbyxve8IY0NTUlSSqVSnbt2pXW1tYadwf8OH19fTnrrLPyYhfuHThw4MQ3BBBzBSxHZgqeYUkIhbNnz56MjIzkO9/5Trq7u5Mke/fuzctf/vJs3bo169atq3GHwIvZuHFjPv3pT8+/b5/tta99bf7qr/6qBl0Bjc5cAcuPmYJnuMKCwlm7dm0+8YlPZHZ2NlNTU0mSnp6etLe317gz4KVcfPHF+cEPfvCiw8Uv//Iv16AjAHMFLEdmCp7hCgsAAACgcJpr3QAAAADA8wksAAAAgMIRWAAAAACFI7AAAAAACkdgAQAAABSOwAIAAAAoHIEFAAAAUDgCCwAAAKBwBBYAAABA4QgsAAAAgMIRWAAAAACF01LrBpbC/v2PZ26uUus2WGQdHadlZuaxWrcBHAfv2/rU3NyU1atPrXUbJ4SZApYX/+7A8nK0maIuA4u5uYrhok75e4Xlx/uW5cxMAcuP9yzUD0tCAAAAgMIRWAAAAACFI7AAAAAACqcu72EBAEuhXD6S/fv35ciRw7Vu5YRoaVmR1as7UyoZFwBgMTTaLPFs1cwVJhAAOEb79+/LSSedklNPPSNNTU21bmdJVSqVPP74wezfvy+nn95T63YAoC400izxbNXOFQsKLL7//e/nt37rt+b/fOjQoTz22GP527/920xOTmZoaCgHDhxIW1tbRkdHs3bt2iSpugYAtXTkyOGGGTCamppy6qmteeyxA7VuBQDqRiPNEs9W7VyxoHtYnH322dm5c+f8/zZu3Ji+vr4kycjISAYHBzM+Pp7BwcEMDw/P71dtDQBqrZEGjEZ6rQBwojTqv6/VvO5FWxJy+PDh7Nq1K7feemtmZmYyMTGR2267LUnS19eXbdu2ZXZ2NpVKpapae3v7YrUKAItiVevJOWnl4q+u/NGTR3Lo4BOL/n0BgGIxS7y0Rft/5t577013d3de8YpX5L777kt3d3dKpVKSpFQqpaurK1NTU6lUKlXVBBYAFM1JK1uy+b07F/377rqxP4eO8jWXXfa2PPXUUzly5Kk89ND38s/+2fokyTnnnJurrx45pp/zta/9XY4cOZJXvernF9gxAFCNWs4SSfKrv7o5K1asyMtetiJzc+W87W3vzL/5N5uq/rmf//yu/M3f/HW2b7++6u/xbIsWWHzuc5/Lm970psX6dgvS0XFarVs4IQ4/Vc6Kl5Vq3cYJ1dm5qtYtnBCN+HdL/aqn9+30dHNaWk7ME8GP9nNuu+2TSZKHH34473jHW/KpT/0/x/0zvvnNr+WJJ57IL/zCL/zYr2lubq6rv0NIlu6MJsXgd1Z9qpcrBopo+/bRrFv3z3P//f//vOtd78yFF16Utra2JMmRI0fS0lK735eL8pP37t2br371q7n++qdTlJ6enuzduzflcjmlUinlcjnT09Pp6elJpVKpqnY8ZmYey9xcZTFeWqF1dq5akjSO2tt1Y3/27TuWTBSKrbNzVV0dy3NzczlyZO6E/Kxj/Tnl8lySSo4cmcuXv/zF3H77f8mTTx7Oy172svz2b78nP/MzP5vvfW9PrrvumvzoRz/K3Fw5v/Irm3PRRf8qf/qnn8vc3Fz+9m+/ko0bL86v//rbX/D95+bmXvB32Nzc1DAnB6hPS3VGE1g6x3rFANU755yX55RTTsl1143kzDPPykMPPZQDB/bnv/yXT+XP/3wsf/Ind6ZcLue0007LVVcN5Sd/cm2eeuqpfOQj1+frX//7dHZ25Sd/cu2i9rQogcWf/umf5rWvfW1Wr16dJOno6Ehvb2/GxsbS39+fsbGx9Pb2zi/rqLYGALy4H/zg+/nDP7w1//f//R9z6qmn5cEHd+eqq/6P/Mmf3J0/+ZM/zr/6V/9r3v72S5MkBw8eTGtra/r7/22eeOKJXHnlu2vb/P/k6WMAUDtf+9rf5fDhw2lpacl9930rN930Bzn55JPzzW9+Pffe+4XcfPN/yooVK/LlL38pv/u71+bjH/8v2bnzc5maejif/OQf5ciRI/mt37rsuC84eCmLFlh84AMfeM62rVu3ZmhoKDt27Ehra2tGR0cXXAMAXtxXvvLl/OAH389v/dbl89vK5XJmZ2dy/vn/S26++aN56qmn8nM/d2F+7ucurGGnP94zTx97xnXXXZdyuZzkn54i1t/fn507d2Z4eDi33377gmoAQPLBD74vK1aszKmnnprrrhvNPff8Rc4772dz8sknJ0m+9KX/ln/8xwdy+eVvT5JUKpUcOnQwSfK1r/19fuVX+tLS0pKWlpZs2vQr+R//4xuL1tuiBBbj4+Mv2LZ+/frceeedL/r11dYAgBdXqVRy0UX/Kh/60LUvqP3rf70xP/Mz/yJ/+7f/PZ/61B/m7rv/LMPD22rQ5bHz9DEAODGeuYfFM+655y9yyiknz/+5Ukne8IZLcuml73rBvpXK0t6Kwd2GAKAOvOpVP5/bbvtPefDB3Vm37uknhnznO99Ob+8r8v3vP5Qzzzwrr3/95px99pr8zu88HWqceuqpefTRfbVs+8cqytPH3KsDoDjq4YaqRbqB9zNKpef21NTUlObmpvltr3nNa3LNNcP5t//2Tenq6k65XM4DD/xDXv7y8/KqV70q99zz+Vx88aYcOXIkf/mX4+nuPuPH/uzjvZm3wAIAqvSjJ49k1439S/J9j9eaNT+Z4eFt+b/+r2158sknc+TIU/nZn31lentfkXvv/ULuuecv8rKXtaSpqSlbtrw3SfKa1/xSPvCB/zNvf/vgj73pZq0U5eljjXIj70ZTDx96oBHVw428n38D76WcJY7nBt7P/tpKpZK5ucr8tp/92f8ll19+Ra666t3/82ufyi/90r/JP//nL09f3/+W++9/IP/u3/1qurq688pX/lympn7wY3/282/mfbQbeQssAKBKhw4+UfM7lvf0nJm77/7/Jnn6KotXvernX/A1b33rb+Stb/2NF2w/88yzctttn17yHo9X0Z4+BgBLpdazxB//8a4XbPvAB7a+YNvFF/9KLr74V16w/WUve1ne974PvGD7Yjkx16IAAByjl3r6WJLnPEWs2hoAUHyusAAACsXTxwCARGABAMelUqmkqamp1m2cEEt95+8fx9PHAKhnjTRLPFs1c4UlIQBwjJqbSymXj/+GmMtVuXwkzc2lWrcBAHWj0WaJZ6tmrhBYAMAxOvnk03Lo0IFUKsd21+3lrFKZy6FD+3PyyR7rCQCLpZFmiWerdq6wJAQAjtFpp/1E9u/fl717v5+k3h912ZQVK07Kaaf9RK0bAYC60VizxLNVN1cILADgGDU1NaW9vavWbQAAy5RZ4vhYEgIAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApnwYHFk08+mZGRkVx88cXZvHlzPvShDyVJJicnMzAwkE2bNmVgYCB79uyZ36faGgAAANAYFhxYfPjDH87KlSszPj6eXbt2ZcuWLUmSkZGRDA4OZnx8PIODgxkeHp7fp9oaAAAA0BgWFFg8/vjjueuuu7Jly5Y0NTUlSU4//fTMzMxkYmIifX19SZK+vr5MTExkdna26hoAAADQOFoWsvNDDz2Utra23HTTTfnKV76SU089NVu2bMlJJ52U7u7ulEqlJEmpVEpXV1empqZSqVSqqrW3tx9zXx0dpy3kZUEhdHauqnULsCgcywAAVGNBgcWRI0fy0EMP5bzzzsv73ve+fPOb38y73vWufPSjH12s/qoyM/NY5uYqNe3hRPAhoL7t23eo1i3AgnV2rnIs16Hm5qYlOTnw5JNP5nd+53fy5S9/OStXrsz555+fbdu2ZXJyMkNDQzlw4EDa2toyOjqatWvXJknVNQCg+Ba0JOTMM89MS0vL/BKOV77ylVm9enVOOumk7N27N+VyOUlSLpczPT2dnp6e9PT0VFUDAOqb+2IBAM+2oMCivb09F110Ub70pS8lefpMxszMTNauXZve3t6MjY0lScbGxtLb25v29vZ0dHRUVQMA6pf7YgEAz7egJSFJcs011+Tqq6/O6OhoWlpacv3116e1tTVbt27N0NBQduzYkdbW1oyOjs7vU20NAKhPRb0vFgBQOwsOLNasWZNPfvKTL9i+fv363HnnnS+6T7U1AKA+FfW+WG7kDVAc7uHXeBYcWAAALNSx3BerVCo95/5WlUqlqtrxaJQbeTcaH3pgeXIj7/pztBt5L+geFgAAi8F9sQCA52uqVCp1d9qgUc6GdHauyub37qx1GyyBXTf2S5CpCx5rWp+W6rGmDz30UK6++uocOHAgLS0tefe7353Xvva12b17d4aGhnLw4MH5+1utW7cuSaquHatGmSkajRkKlh/zcX062kxhSQgAUAjuiwUAPJslIQAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFE7LQr/Bhg0bsmLFiqxcuTJJctVVV+XVr351JicnMzQ0lAMHDqStrS2jo6NZu3ZtklRdAwAAABrDolxh8bGPfSw7d+7Mzp078+pXvzpJMjIyksHBwYyPj2dwcDDDw8PzX19tDQAAAGgMS7IkZGZmJhMTE+nr60uS9PX1ZWJiIrOzs1XXAAAAgMax4CUhydPLQCqVSi644IK85z3vydTUVLq7u1MqlZIkpVIpXV1dmZqaSqVSqarW3t6+GK0CAAAAy8CCA4s77rgjPT09OXz4cK677rpce+21efvb374IrVWvo+O0mv58WAydnatq3QIsCscyx8O9sQCAZyw4sOjp6UmSrFixIoODg/nN3/zNvP/978/evXtTLpdTKpVSLpczPT2dnp6eVCqVqmrHY2bmsczNVRb60grPh4D6tm/foVq3AAvW2bnKsVyHmpublvTkwMc+9rGcc845z9n2zD2u+vv7s3PnzgwPD+f2229fUA0AKLYF3cPihz/8YQ4denoQrVQq+fznP5/e3t50dHSkt7c3Y2NjSZKxsbH09vamvb296hoA0JjcGwsAGtOCrrCYmZnJb//2b6dcLmdubi7r16/PyMhIkmTr1q0ZGhrKjh070tramtHR0fn9qq0BAPWvSPfGsswUoDhcYd54FhRYrFmzJnfdddeL1tavX58777xzUWsAQH0r2r2xGmWZaaPxoQeWJ8tM68/RlpkuyWNNAQCq8fx7Y33ta19LT0/P/D2ukjznHlfV1gCA4hNYAACF4N5YAMCzLfgpIQAAi8G9sQCAZxNYAACF4N5YAMCzWRICAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhLFpgcdNNN+Xcc8/N/fffnySZnJzMwMBANm3alIGBgezZs2f+a6utAQAAAI1hUQKLb3/72/nGN76RM888c37byMhIBgcHMz4+nsHBwQwPDy+4BgAAADSGBQcWhw8fzrXXXpuRkZE0NTUlSWZmZjIxMZG+vr4kSV9fXyYmJjI7O1t1DQAAAGgcCw4sPvrRj+aSSy7JmjVr5rdNTU2lu7s7pVIpSVIqldLV1ZWpqamqawBAY7DMFABIkpaF7Pz1r3893/rWt3LVVVctVj+LoqPjtFq3AAvW2bmq1i3AonAsczxeaplpf39/du7cmeHh4dx+++0LqgEAxbegwOKrX/1qHnzwwWzcuDFJ8sgjj+Sd73xn3v/+92fv3r0pl8splUopl8uZnp5OT09PKpVKVbXjMTPzWObmKgt5acuCDwH1bd++Q7VuARass3OVY7kONTc3LcnJgWeWmd5www1529veluSflpnedtttSZ5eLrpt27bMzs6mUqlUVWtvb1/03gGAxbegJSGXX355vvjFL+bee+/NvffemzPOOCO33nprXv/616e3tzdjY2NJkrGxsfT29qa9vT0dHR1V1QCA+maZKQDwbAu6wuKlbN26NUNDQ9mxY0daW1szOjq64BoAUJ8sMwXgaFxh3ngWNbC499575/97/fr1ufPOO1/066qtAQD1yTJTTiQfemB5ssy0/hxtmemCnxICALBQlpkCAM+3ZEtCAAAWg2WmANCYBBYAQOFYZgoAWBICAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIUjsAAAAAAKR2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHAEFgAAAEDhCCwAAACAwhFYAAAAAIXTstBvcMUVV+T73/9+mpubc8opp+RDH/pQent7Mzk5maGhoRw4cCBtbW0ZHR3N2rVrk6TqGgAAANAYFnyFxejoaP7sz/4sd911V37jN34jV199dZJkZGQkg4ODGR8fz+DgYIaHh+f3qbYGAAAANIYFBxarVq2a/+/HHnssTU1NmZmZycTERPr6+pIkfX19mZiYyOzsbNU1AKD+XXHFFbnkkkvyxje+MYODg/nOd76T5OkrMAcGBrJp06YMDAxkz5498/tUWwMAim3BS0KS5AMf+EC+9KUvpVKp5D//5/+cqampdHd3p1QqJUlKpVK6uroyNTWVSqVSVa29vf2Y++noOG0xXhbUVGfnqqN/ESwDjmWOx+jo6PzJkL/8y7/M1VdfnT/90z+dvwKzv78/O3fuzPDwcG6//fYkqboGABTbogQW1113XZLkrrvuyvXXX58tW7Ysxret2szMY5mbq9S0hxPBh4D6tm/foVq3AAvW2bnKsVyHmpubluzkwEtduXnbbbclefoKzG3btmV2djaVSqWq2vGcCAEAamNRAotnvPGNb8zw8HDOOOOM7N27N+VyOaVSKeVyOdPT0+np6UmlUqmqBgA0hqJduQkA1MaCAovHH388Bw8enA8U7r333vzET/xEOjo60tvbm7GxsfT392dsbCy9vb3zw0G1NQCg/hXpyk3LTAGKwxXmjWdBgcUTTzyRLVu25Iknnkhzc3N+4id+IrfcckuampqydevWDA0NZceOHWltbc3o6Oj8ftXWAIDGUYQrNxtlmWmj8aEHlifLTOvP0ZaZLiiwOP300/NHf/RHL1pbv3597rzzzkWtAQD1y5WbAMCzLeo9LAAAquXKTQDg2QQWAEAhuHITAHi25lo3AAAAAPB8AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUzoICi/379+eyyy7Lpk2bsnnz5lx55ZWZnZ1NkkxOTmZgYCCbNm3KwMBA9uzZM79ftTUAAACgMSwosGhqasqll16a8fHx7Nq1K2vWrMkNN9yQJBkZGcng4GDGx8czODiY4eHh+f2qrQEAAACNYUGBRVtbWy666KL5P59//vl5+OGHMzMzk4mJifT19SVJ+vr6MjExkdnZ2aprAEB9c+UmAPBsi3YPi7m5uXzmM5/Jhg0bMjU1le7u7pRKpSRJqVRKV1dXpqamqq4BAPXNlZsAwLO1LNY32rZtW0455ZS85S1vycTExGJ926p0dJxW058Pi6Gzc1WtW4BF4VjmWL3YlZuf+cxn5q/AvO2225I8fQXmtm3bMjs7m0qlUlWtvb39xL9AAOC4LEpgMTo6mu9+97u55ZZb0tzcnJ6enuzduzflcjmlUinlcjnT09Pp6elJpVKpqnY8ZmYey9xcZTFeWqH5EFDf9u07VOsWYME6O1c5lutQc3PTkp8cONYrNyuVSlW1Yw0snAQBKA6ffxrPggOLj3zkI7nvvvvyB3/wB1mxYkWSpKOjI729vRkbG0t/f3/GxsbS29s7PxxUWwMAGkNRrtxslJMgjcaHHlienASpP0c7CbKgwOKBBx7ILbfckrVr1+bNb35zkuTss8/OzTffnK1bt2ZoaCg7duxIa2trRkdH5/ertgYA1L+iXbkJANTGggKLn/7pn84//MM/vGht/fr1ufPOOxe1BgDUN1duAgDPaKpUKnV3nWOjXL7Z2bkqm9+7s9ZtsAR23djvkjfqgntY1KeluofFAw88kL6+vqxduzYnnXRSkn+6cnP37t0ZGhrKwYMH56/AXLduXZJUXTsWjTJTNBozFCw/5uP6tKRLQgAAFosrNwGAZ2uudQMAAAAAzyewAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEsKLAYHR3Nhg0bcu655+b++++f3z45OZmBgYFs2rQpAwMD2bNnz4JrAAAAQONYUGCxcePG3HHHHTnrrLOes31kZCSDg4MZHx/P4OBghoeHF1wDAOqbEyEAwLMtKLC48MIL09PT85xtMzMzmZiYSF9fX5Kkr68vExMTmZ2drboGANQ/J0IAgGdb9HtYTE1Npbu7O6VSKUlSKpXS1dWVqampqmsAQP1zIgQAeLaWWjewFDo6Tqt1C7BgnZ2rat0CLArHMgvxUic0KpVKVbX29vaavR4A4NgtemDR09OTvXv3plwup1QqpVwuZ3p6Oj09PalUKlXVjtfMzGOZm6ss9ksrHB8C6tu+fYdq3QIsWGfnKsdyHWpubmqYkwON8joBlgOffxrPogcWHR0d6e3tzdjYWPr7+zM2Npbe3t75sxnV1gCAxlPrEyGNchKk0fjQA8uTkyD152gnQRZ0D4vt27fnNa95TR555JG84x3vyBve8IYkydatW/OpT30qmzZtyqc+9alcc8018/tUWwMAGs+zT4Qkec4JjWprAMDy0FSpVOrutEGjnA3p7FyVze/dWes2WAK7buyXIFMXLAmpT0u1JGT79u2555578uijj2b16tVpa2vL3Xffnd27d2doaCgHDx5Ma2trRkdHs27duiSpunasGmWmaDRmKFh+zMf16WgzhcBiGfOPbf3yC5l6IbCoT410D4tGmSkajRkKlh/zcX1a0iUhAAAAAEtBYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCEVgAAAAAhSOwAAAAAApHYAEAAAAUjsACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOEILAAAAIDCaal1AwCNZFXryTlpZWP96u3sXFXrFk6IHz15JIcOPlHrNgAA6kZjTc0ANXbSypZsfu/OWrfBEth1Y38O1boJAIA6UsglIZOTkxkYGMimTZsyMDCQPXv21LolAGAZMlMAwPJVyMBiZGQkg4ODGR8fz+DgYIaHh2vdEgCwDJkpAGD5KtySkJmZmUxMTOS2225LkvT19WXbtm2ZnZ1Ne3v7MX2P5uampWyxULpWn1zrFlgijXQcNxrv2/rVKO/b5fI6zRS8FL+LYfnxO7n+HO3vtHCBxdTUVLq7u1MqlZIkpVIpXV1dmZqaOubhYvXqU5eyxUK59YMX17oFlkhHx2m1boEl4n1bv7xvi8VMwUvxuxiWH//ONp5CLgkBAAAAGlvhAouenp7s3bs35XI5SVIulzM9PZ2enp4adwYALCdmCgBY3goXWHR0dKS3tzdjY2NJkrGxsfT29h7zpZsAAImZAgCWu6ZKpVKpdRPPt3v37gwNDeXgwYNpbW3N6Oho1q1bV+u2AIBlxkwBAMtXIQMLAAAAoLEVbkkIAAAAgMACAAAAKByBBQAAAFA4AgsAAACgcAQWAAAAQOG01LoB+HH279+fRx55JElyxhlnZPXq1TXuCACAojEzQv0SWFA43/ve9/KhD30oExMT6erqSpJMT0/nvPPOyzXXXJO1a9fWtkEAAGrOzAj1r6lSqVRq3QQ825vf/OYMDg6mr68vzc1Pr1qam5vLrl278ulPfzqf/exna9whcLw2b96cXbt21boNAOqImRHqnyssKJwDBw7kkksuec625ubm9Pf35+Mf/3iNugKO5h//8R9/bG3//v0nsBMAGoGZEeqfwILCaWtry9jYWN7whjekqakpSVKpVLJr1660trbWuDvgx+nr68tZZ52VF7tw78CBAye+IQDqmpkR6p8lIRTOnj17MjIyku985zvp7u5Okuzduzcvf/nLs3Xr1qxbt67GHQIvZuPGjfn0pz89/759tte+9rX5q7/6qxp0BUC9MjNC/XOFBYWzdu3afOITn8js7GympqaSJD09PWlvb69xZ8BLufjii/ODH/zgRQOLX/7lX65BRwDUMzMj1D9XWAAAAACF01zrBgAAAACeT2ABAAAAFI7AAgAAACgcgQUAAABQOAILAAAAoHD+XzWJPmwmk9o2AAAAAElFTkSuQmCC",
jpronkko's avatar
jpronkko committed
      "text/plain": [
       "<Figure size 1080x576 with 4 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "image/png": "",
jpronkko's avatar
jpronkko committed
      "text/plain": [
       "<Figure size 864x720 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "image/png": "",
jpronkko's avatar
jpronkko committed
      "text/plain": [
       "<Figure size 504x432 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# For binary classification, construct y vector for a sigle selected tumor type\n",
    "\n",
    "target_type = \"Biliary-AdenoCA\"\n",
    "\n",
    "y_prf = profile_mut_all[\"tumor_types\"].values\n",
    "\n",
    "# Encode for classificaiton to two classes: if the type is the desired type, set to 1 otherwise 0\n",
    "y_prf_bin = [1 if tumor_type == target_type else 0 for tumor_type in y_prf]\n",
    "\n",
    "# Split the data for fitting and prediction, use simple splitting here\n",
    "X_prf_train, X_prf_test, y_prf_train, y_prf_test = train_test_split(X_prf, y_prf_bin, test_size = 0.3, random_state=898)\n",
    "\n",
    "print(f\"Dimension of the training data {X_prf_train.shape} and test data {X_prf_test.shape}\")\n",
    "\n",
    "# Make a model\n",
    "model_rfs = RandomForestClassifier()\n",
    "\n",
    "# Fit the model \n",
    "clf= model_rfs.fit(X_prf_train, y_prf_train)\n",
    "\n",
    "# Predict with unused (test) data \n",
    "y_prf_pred = model_rfs.predict(X_prf_test)\n",
    "\n",
    "# What we got \n",
    "plot_trn_tst_dist(y_prf_bin, y_prf_train, y_prf_test, y_prf_pred, in_cols=True)\n",
    "print(f\"Accuracy:\", accuracy_score(y_prf_test, y_prf_pred))\n",
    "print(classification_report(y_prf_test, y_prf_pred))\n",
    "\n",
    "# Plot some results\n",
    "plot_confusion_mat(y_prf_test, y_prf_pred, labs=[\"0\", \"1\"])\n",
    "plot_roc_auc(X_prf_test, y_prf_test, model_rfs)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We notice the training material is very skewed and results in classifier, which does not predict class 1 in any case. This is not good. Let's try oversampling, how it effects the situation. You need to install imbalanced-learn for this to work."
   ]
  },
Loading
Loading full blame...