{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Project Tasks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the first few assignments, we have learned how to infer part based components (known as mutational signatures) generated by particular mutational processes using Non-negative Matrix Factorization (NMF). By doing this, we are trying to reconstruct the mutation catalog in a given sample with mutational signatures and their contributions.\n", "\n", "In this group project, you will use similar mutational profiles and signature activities to predict cancer types but with much larger sample size. \n", "You should:\n", "* Separate the data into training and test groups within each cancer type.\n", "* Find out which features are informative for the prediction of the cancer type (label). You should combine the profiles and activities and use each data type independently.\n", "* Implement different models for classification of the samples given the input data and evaluate the model performance using test data to avoid overfitting. Explain briefly how does each model that you have used work.\n", "* Report model performance, using standard machine learning metrics such as confusion matrices etc. \n", "* Compare model performance across methods and across cancer types, are some types easier top predict than others.\n", "* Submit a single Jupyter notebook as the final report and present that during the last assignment session " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data include both mutational catalogs from multiple cancers and the predicted activities in the paper [\"Alexandrov LB, et al. (2020) The repertoire of mutational signatures in human cancer\"](https://www.nature.com/articles/s41586-020-1943-3). The data either are generated from whole human genome (WGS) or only exomes regions (WES). Since the exome region only constitutes about 1% of human genome, the total mutation numbers in these samples are, of course, much smaller. So if you plan to use WGS together with WES data, remember to normalize the profile for each sample to sum up to 1.\n", "\n", "Note that, the data is generated from different platforms by different research groups, some of them (e.g. labeled with PCAWG, TCGA) are processed with the same bioinformatics pipeline. Thus, these samples will have less variability related to data processing pipelines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cancer types might be labeled under the same tissue, e.g. 'Bone-Benign','Bone-Epith', which can also be combined together or take the one has more samples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a link to background reading [\"Pan-Cancer Analysis of Whole Genomes\"](https://www.nature.com/collections/afdejfafdb). Have a look especially the paper [\"A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns\"](https://www.nature.com/articles/s41467-019-13825-8)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mutational catalogs and activities - WGS data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Mutation type</th>\n", " <th>Trinucleotide</th>\n", " <th>Biliary-AdenoCA::SP117655</th>\n", " <th>Biliary-AdenoCA::SP117556</th>\n", " <th>Biliary-AdenoCA::SP117627</th>\n", " <th>Biliary-AdenoCA::SP117775</th>\n", " <th>Biliary-AdenoCA::SP117332</th>\n", " <th>Biliary-AdenoCA::SP117712</th>\n", " <th>Biliary-AdenoCA::SP117017</th>\n", " <th>Biliary-AdenoCA::SP117031</th>\n", " <th>...</th>\n", " <th>Uterus-AdenoCA::SP94540</th>\n", " <th>Uterus-AdenoCA::SP95222</th>\n", " <th>Uterus-AdenoCA::SP89389</th>\n", " <th>Uterus-AdenoCA::SP90503</th>\n", " <th>Uterus-AdenoCA::SP92460</th>\n", " <th>Uterus-AdenoCA::SP92931</th>\n", " <th>Uterus-AdenoCA::SP91265</th>\n", " <th>Uterus-AdenoCA::SP89909</th>\n", " <th>Uterus-AdenoCA::SP90629</th>\n", " <th>Uterus-AdenoCA::SP95550</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>C>A</td>\n", " <td>ACA</td>\n", " <td>269</td>\n", " <td>114</td>\n", " <td>105</td>\n", " <td>217</td>\n", " <td>52</td>\n", " <td>192</td>\n", " <td>54</td>\n", " <td>196</td>\n", " <td>...</td>\n", " <td>117</td>\n", " <td>233</td>\n", " <td>94</td>\n", " <td>114</td>\n", " <td>257</td>\n", " <td>139</td>\n", " <td>404</td>\n", " <td>97</td>\n", " <td>250</td>\n", " <td>170</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>C>A</td>\n", " <td>ACC</td>\n", " <td>148</td>\n", " <td>56</td>\n", " <td>71</td>\n", " <td>123</td>\n", " <td>36</td>\n", " <td>139</td>\n", " <td>54</td>\n", " <td>102</td>\n", " <td>...</td>\n", " <td>90</td>\n", " <td>167</td>\n", " <td>59</td>\n", " <td>64</td>\n", " <td>268</td>\n", " <td>75</td>\n", " <td>255</td>\n", " <td>78</td>\n", " <td>188</td>\n", " <td>137</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>2 rows × 2782 columns</p>\n", "</div>" ], "text/plain": [ " Mutation type Trinucleotide Biliary-AdenoCA::SP117655 \\\n", "0 C>A ACA 269 \n", "1 C>A ACC 148 \n", "\n", " Biliary-AdenoCA::SP117556 Biliary-AdenoCA::SP117627 \\\n", "0 114 105 \n", "1 56 71 \n", "\n", " Biliary-AdenoCA::SP117775 Biliary-AdenoCA::SP117332 \\\n", "0 217 52 \n", "1 123 36 \n", "\n", " Biliary-AdenoCA::SP117712 Biliary-AdenoCA::SP117017 \\\n", "0 192 54 \n", "1 139 54 \n", "\n", " Biliary-AdenoCA::SP117031 ... Uterus-AdenoCA::SP94540 \\\n", "0 196 ... 117 \n", "1 102 ... 90 \n", "\n", " Uterus-AdenoCA::SP95222 Uterus-AdenoCA::SP89389 Uterus-AdenoCA::SP90503 \\\n", "0 233 94 114 \n", "1 167 59 64 \n", "\n", " Uterus-AdenoCA::SP92460 Uterus-AdenoCA::SP92931 Uterus-AdenoCA::SP91265 \\\n", "0 257 139 404 \n", "1 268 75 255 \n", "\n", " Uterus-AdenoCA::SP89909 Uterus-AdenoCA::SP90629 Uterus-AdenoCA::SP95550 \n", "0 97 250 170 \n", "1 78 188 137 \n", "\n", "[2 rows x 2782 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## PCAWG data is performed by the same pipeline\n", "PCAWG_wgs_mut = pd.read_csv (\"./project_data/catalogs/WGS/WGS_PCAWG.96.csv\")\n", "PCAWG_wgs_mut.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accuracy is the cosine similarity of reconstruct catalog to the observed catalog " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Cancer Types</th>\n", " <th>Sample Names</th>\n", " <th>Accuracy</th>\n", " <th>SBS1</th>\n", " <th>SBS2</th>\n", " <th>SBS3</th>\n", " <th>SBS4</th>\n", " <th>SBS5</th>\n", " <th>SBS6</th>\n", " <th>SBS7a</th>\n", " <th>...</th>\n", " <th>SBS51</th>\n", " <th>SBS52</th>\n", " <th>SBS53</th>\n", " <th>SBS54</th>\n", " <th>SBS55</th>\n", " <th>SBS56</th>\n", " <th>SBS57</th>\n", " <th>SBS58</th>\n", " <th>SBS59</th>\n", " <th>SBS60</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Biliary-AdenoCA</td>\n", " <td>SP117655</td>\n", " <td>0.968</td>\n", " <td>1496</td>\n", " <td>1296</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1825</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Biliary-AdenoCA</td>\n", " <td>SP117556</td>\n", " <td>0.963</td>\n", " <td>985</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>922</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>2 rows × 68 columns</p>\n", "</div>" ], "text/plain": [ " Cancer Types Sample Names Accuracy SBS1 SBS2 SBS3 SBS4 SBS5 SBS6 \\\n", "0 Biliary-AdenoCA SP117655 0.968 1496 1296 0 0 1825 0 \n", "1 Biliary-AdenoCA SP117556 0.963 985 0 0 0 922 0 \n", "\n", " SBS7a ... SBS51 SBS52 SBS53 SBS54 SBS55 SBS56 SBS57 SBS58 SBS59 \\\n", "0 0 ... 0 0 0 0 0 0 0 0 0 \n", "1 0 ... 0 0 0 0 0 0 0 0 0 \n", "\n", " SBS60 \n", "0 0 \n", "1 0 \n", "\n", "[2 rows x 68 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Activities:\n", "PCAWG_wgs_act = pd.read_csv (\"./project_data/activities/WGS/WGS_PCAWG.activities.csv\")\n", "PCAWG_wgs_act.head(2)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Mutation type</th>\n", " <th>Trinucleotide</th>\n", " <th>ALL::PD4020a</th>\n", " <th>ALL::SJBALL011_D</th>\n", " <th>ALL::SJBALL012_D</th>\n", " <th>ALL::SJBALL020013_D1</th>\n", " <th>ALL::SJBALL020422_D1</th>\n", " <th>ALL::SJBALL020579_D1</th>\n", " <th>ALL::SJBALL020589_D1</th>\n", " <th>ALL::SJBALL020625_D1</th>\n", " <th>...</th>\n", " <th>Stomach-AdenoCa::pfg316T</th>\n", " <th>Stomach-AdenoCa::pfg317T</th>\n", " <th>Stomach-AdenoCa::pfg344T</th>\n", " <th>Stomach-AdenoCa::pfg373T</th>\n", " <th>Stomach-AdenoCa::pfg375T</th>\n", " <th>Stomach-AdenoCa::pfg378T</th>\n", " <th>Stomach-AdenoCa::pfg398T</th>\n", " <th>Stomach-AdenoCa::pfg413T</th>\n", " <th>Stomach-AdenoCa::pfg416T</th>\n", " <th>Stomach-AdenoCa::pfg424T</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>C>A</td>\n", " <td>ACA</td>\n", " <td>35</td>\n", " <td>9</td>\n", " <td>2</td>\n", " <td>7</td>\n", " <td>5</td>\n", " <td>7</td>\n", " <td>3</td>\n", " <td>5</td>\n", " <td>...</td>\n", " <td>133</td>\n", " <td>185</td>\n", " <td>202</td>\n", " <td>185</td>\n", " <td>96</td>\n", " <td>134</td>\n", " <td>12</td>\n", " <td>279</td>\n", " <td>75</td>\n", " <td>135</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>C>A</td>\n", " <td>ACC</td>\n", " <td>16</td>\n", " <td>2</td>\n", " <td>4</td>\n", " <td>10</td>\n", " <td>5</td>\n", " <td>9</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>...</td>\n", " <td>48</td>\n", " <td>70</td>\n", " <td>126</td>\n", " <td>88</td>\n", " <td>35</td>\n", " <td>54</td>\n", " <td>16</td>\n", " <td>112</td>\n", " <td>31</td>\n", " <td>91</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>2 rows × 1867 columns</p>\n", "</div>" ], "text/plain": [ " Mutation type Trinucleotide ALL::PD4020a ALL::SJBALL011_D \\\n", "0 C>A ACA 35 9 \n", "1 C>A ACC 16 2 \n", "\n", " ALL::SJBALL012_D ALL::SJBALL020013_D1 ALL::SJBALL020422_D1 \\\n", "0 2 7 5 \n", "1 4 10 5 \n", "\n", " ALL::SJBALL020579_D1 ALL::SJBALL020589_D1 ALL::SJBALL020625_D1 ... \\\n", "0 7 3 5 ... \n", "1 9 1 2 ... \n", "\n", " Stomach-AdenoCa::pfg316T Stomach-AdenoCa::pfg317T \\\n", "0 133 185 \n", "1 48 70 \n", "\n", " Stomach-AdenoCa::pfg344T Stomach-AdenoCa::pfg373T \\\n", "0 202 185 \n", "1 126 88 \n", "\n", " Stomach-AdenoCa::pfg375T Stomach-AdenoCa::pfg378T \\\n", "0 96 134 \n", "1 35 54 \n", "\n", " Stomach-AdenoCa::pfg398T Stomach-AdenoCa::pfg413T \\\n", "0 12 279 \n", "1 16 112 \n", "\n", " Stomach-AdenoCa::pfg416T Stomach-AdenoCa::pfg424T \n", "0 75 135 \n", "1 31 91 \n", "\n", "[2 rows x 1867 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nonPCAWG_wgs_mut = pd.read_csv (\"./project_data/catalogs/WGS/WGS_Other.96.csv\")\n", "nonPCAWG_wgs_mut.head(2)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Cancer Types</th>\n", " <th>Sample Names</th>\n", " <th>Accuracy</th>\n", " <th>SBS1</th>\n", " <th>SBS2</th>\n", " <th>SBS3</th>\n", " <th>SBS4</th>\n", " <th>SBS5</th>\n", " <th>SBS6</th>\n", " <th>SBS7a</th>\n", " <th>...</th>\n", " <th>SBS51</th>\n", " <th>SBS52</th>\n", " <th>SBS53</th>\n", " <th>SBS54</th>\n", " <th>SBS55</th>\n", " <th>SBS56</th>\n", " <th>SBS57</th>\n", " <th>SBS58</th>\n", " <th>SBS59</th>\n", " <th>SBS60</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ALL</td>\n", " <td>PD4020a</td>\n", " <td>0.995</td>\n", " <td>208</td>\n", " <td>3006</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>365</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ALL</td>\n", " <td>SJBALL011_D</td>\n", " <td>0.905</td>\n", " <td>66</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>144</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>2 rows × 68 columns</p>\n", "</div>" ], "text/plain": [ " Cancer Types Sample Names Accuracy SBS1 SBS2 SBS3 SBS4 SBS5 SBS6 \\\n", "0 ALL PD4020a 0.995 208 3006 0 0 365 0 \n", "1 ALL SJBALL011_D 0.905 66 0 0 0 144 0 \n", "\n", " SBS7a ... SBS51 SBS52 SBS53 SBS54 SBS55 SBS56 SBS57 SBS58 SBS59 \\\n", "0 0 ... 0 0 0 0 0 0 0 0 0 \n", "1 0 ... 0 0 0 0 0 0 0 0 0 \n", "\n", " SBS60 \n", "0 0 \n", "1 0 \n", "\n", "[2 rows x 68 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nonPCAWG_wgs_act = pd.read_csv (\"./project_data/activities/WGS/WGS_Other.activities.csv\")\n", "nonPCAWG_wgs_act.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mutational catalogs - WES data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Mutation type</th>\n", " <th>Trinucleotide</th>\n", " <th>AML::TCGA-AB-2802-03B-01W-0728-08</th>\n", " <th>AML::TCGA-AB-2803-03B-01W-0728-08</th>\n", " <th>AML::TCGA-AB-2804-03B-01W-0728-08</th>\n", " <th>AML::TCGA-AB-2805-03B-01W-0728-08</th>\n", " <th>AML::TCGA-AB-2806-03B-01W-0728-08</th>\n", " <th>AML::TCGA-AB-2807-03B-01W-0728-08</th>\n", " <th>AML::TCGA-AB-2808-03B-01W-0728-08</th>\n", " <th>AML::TCGA-AB-2809-03D-01W-0755-09</th>\n", " <th>...</th>\n", " <th>Eye-Melanoma::TCGA-WC-A885-01A-11D-A39W-08</th>\n", " <th>Eye-Melanoma::TCGA-WC-A888-01A-11D-A39W-08</th>\n", " <th>Eye-Melanoma::TCGA-WC-A88A-01A-11D-A39W-08</th>\n", " <th>Eye-Melanoma::TCGA-WC-AA9A-01A-11D-A39W-08</th>\n", " <th>Eye-Melanoma::TCGA-WC-AA9E-01A-11D-A39W-08</th>\n", " <th>Eye-Melanoma::TCGA-YZ-A980-01A-11D-A39W-08</th>\n", " <th>Eye-Melanoma::TCGA-YZ-A982-01A-11D-A39W-08</th>\n", " <th>Eye-Melanoma::TCGA-YZ-A983-01A-11D-A39W-08</th>\n", " <th>Eye-Melanoma::TCGA-YZ-A984-01A-11D-A39W-08</th>\n", " <th>Eye-Melanoma::TCGA-YZ-A985-01A-11D-A39W-08</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>C>A</td>\n", " <td>ACA</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>4</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>C>A</td>\n", " <td>ACC</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>2 rows × 9495 columns</p>\n", "</div>" ], "text/plain": [ " Mutation type Trinucleotide AML::TCGA-AB-2802-03B-01W-0728-08 \\\n", "0 C>A ACA 0 \n", "1 C>A ACC 0 \n", "\n", " AML::TCGA-AB-2803-03B-01W-0728-08 AML::TCGA-AB-2804-03B-01W-0728-08 \\\n", "0 0 0 \n", "1 2 0 \n", "\n", " AML::TCGA-AB-2805-03B-01W-0728-08 AML::TCGA-AB-2806-03B-01W-0728-08 \\\n", "0 0 4 \n", "1 0 0 \n", "\n", " AML::TCGA-AB-2807-03B-01W-0728-08 AML::TCGA-AB-2808-03B-01W-0728-08 \\\n", "0 0 2 \n", "1 1 3 \n", "\n", " AML::TCGA-AB-2809-03D-01W-0755-09 ... \\\n", "0 0 ... \n", "1 0 ... \n", "\n", " Eye-Melanoma::TCGA-WC-A885-01A-11D-A39W-08 \\\n", "0 1 \n", "1 0 \n", "\n", " Eye-Melanoma::TCGA-WC-A888-01A-11D-A39W-08 \\\n", "0 0 \n", "1 0 \n", "\n", " Eye-Melanoma::TCGA-WC-A88A-01A-11D-A39W-08 \\\n", "0 0 \n", "1 0 \n", "\n", " Eye-Melanoma::TCGA-WC-AA9A-01A-11D-A39W-08 \\\n", "0 0 \n", "1 0 \n", "\n", " Eye-Melanoma::TCGA-WC-AA9E-01A-11D-A39W-08 \\\n", "0 0 \n", "1 0 \n", "\n", " Eye-Melanoma::TCGA-YZ-A980-01A-11D-A39W-08 \\\n", "0 0 \n", "1 0 \n", "\n", " Eye-Melanoma::TCGA-YZ-A982-01A-11D-A39W-08 \\\n", "0 0 \n", "1 0 \n", "\n", " Eye-Melanoma::TCGA-YZ-A983-01A-11D-A39W-08 \\\n", "0 0 \n", "1 1 \n", "\n", " Eye-Melanoma::TCGA-YZ-A984-01A-11D-A39W-08 \\\n", "0 0 \n", "1 0 \n", "\n", " Eye-Melanoma::TCGA-YZ-A985-01A-11D-A39W-08 \n", "0 0 \n", "1 0 \n", "\n", "[2 rows x 9495 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Performed by TCGA pipeline\n", "TCGA_wes_mut = pd.read_csv (\"./project_data/catalogs/WES/WES_TCGA.96.csv\")\n", "TCGA_wes_mut.head(2)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Cancer Types</th>\n", " <th>Sample Names</th>\n", " <th>Accuracy</th>\n", " <th>SBS1</th>\n", " <th>SBS2</th>\n", " <th>SBS3</th>\n", " <th>SBS4</th>\n", " <th>SBS5</th>\n", " <th>SBS6</th>\n", " <th>SBS7a</th>\n", " <th>...</th>\n", " <th>SBS51</th>\n", " <th>SBS52</th>\n", " <th>SBS53</th>\n", " <th>SBS54</th>\n", " <th>SBS55</th>\n", " <th>SBS56</th>\n", " <th>SBS57</th>\n", " <th>SBS58</th>\n", " <th>SBS59</th>\n", " <th>SBS60</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>AML</td>\n", " <td>TCGA-AB-2802-03B-01W-0728-08</td>\n", " <td>0.811</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>AML</td>\n", " <td>TCGA-AB-2803-03B-01W-0728-08</td>\n", " <td>0.608</td>\n", " <td>4</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>7</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>2 rows × 68 columns</p>\n", "</div>" ], "text/plain": [ " Cancer Types Sample Names Accuracy SBS1 SBS2 SBS3 \\\n", "0 AML TCGA-AB-2802-03B-01W-0728-08 0.811 3 0 0 \n", "1 AML TCGA-AB-2803-03B-01W-0728-08 0.608 4 0 0 \n", "\n", " SBS4 SBS5 SBS6 SBS7a ... SBS51 SBS52 SBS53 SBS54 SBS55 SBS56 \\\n", "0 0 0 0 0 ... 0 0 0 0 0 0 \n", "1 0 7 0 0 ... 0 0 0 0 0 0 \n", "\n", " SBS57 SBS58 SBS59 SBS60 \n", "0 0 0 0 0 \n", "1 0 0 0 0 \n", "\n", "[2 rows x 68 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##Activities\n", "TCGA_wes_act = pd.read_csv(\"./project_data/activities/WES/WES_TCGA.activities.csv\")\n", "TCGA_wes_act.head(2)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Mutation type</th>\n", " <th>Trinucleotide</th>\n", " <th>ALL::TARGET-10-PAIXPH-03A-01D</th>\n", " <th>ALL::TARGET-10-PAKHZT-03A-01R</th>\n", " <th>ALL::TARGET-10-PAKMVD-09A-01D</th>\n", " <th>ALL::TARGET-10-PAKSWW-03A-01D</th>\n", " <th>ALL::TARGET-10-PALETF-03A-01D</th>\n", " <th>ALL::TARGET-10-PALLSD-09A-01D</th>\n", " <th>ALL::TARGET-10-PAMDKS-03A-01D</th>\n", " <th>ALL::TARGET-10-PAPJIB-04A-01D</th>\n", " <th>...</th>\n", " <th>Head-SCC::V-109</th>\n", " <th>Head-SCC::V-112</th>\n", " <th>Head-SCC::V-116</th>\n", " <th>Head-SCC::V-119</th>\n", " <th>Head-SCC::V-123</th>\n", " <th>Head-SCC::V-124</th>\n", " <th>Head-SCC::V-125</th>\n", " <th>Head-SCC::V-14</th>\n", " <th>Head-SCC::V-29</th>\n", " <th>Head-SCC::V-98</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>C>A</td>\n", " <td>ACA</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>C>A</td>\n", " <td>ACC</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>2 rows × 9693 columns</p>\n", "</div>" ], "text/plain": [ " Mutation type Trinucleotide ALL::TARGET-10-PAIXPH-03A-01D \\\n", "0 C>A ACA 0 \n", "1 C>A ACC 0 \n", "\n", " ALL::TARGET-10-PAKHZT-03A-01R ALL::TARGET-10-PAKMVD-09A-01D \\\n", "0 0 0 \n", "1 0 0 \n", "\n", " ALL::TARGET-10-PAKSWW-03A-01D ALL::TARGET-10-PALETF-03A-01D \\\n", "0 1 0 \n", "1 1 0 \n", "\n", " ALL::TARGET-10-PALLSD-09A-01D ALL::TARGET-10-PAMDKS-03A-01D \\\n", "0 0 0 \n", "1 0 0 \n", "\n", " ALL::TARGET-10-PAPJIB-04A-01D ... Head-SCC::V-109 Head-SCC::V-112 \\\n", "0 2 ... 0 0 \n", "1 0 ... 1 0 \n", "\n", " Head-SCC::V-116 Head-SCC::V-119 Head-SCC::V-123 Head-SCC::V-124 \\\n", "0 0 0 0 0 \n", "1 0 0 0 0 \n", "\n", " Head-SCC::V-125 Head-SCC::V-14 Head-SCC::V-29 Head-SCC::V-98 \n", "0 0 0 0 1 \n", "1 0 1 0 0 \n", "\n", "[2 rows x 9693 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "other_wes_mut = pd.read_csv(\"./project_data/catalogs/WES/WES_Other.96.csv\")\n", "other_wes_mut.head(2)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Cancer Types</th>\n", " <th>Sample Names</th>\n", " <th>Accuracy</th>\n", " <th>SBS1</th>\n", " <th>SBS2</th>\n", " <th>SBS3</th>\n", " <th>SBS4</th>\n", " <th>SBS5</th>\n", " <th>SBS6</th>\n", " <th>SBS7a</th>\n", " <th>...</th>\n", " <th>SBS51</th>\n", " <th>SBS52</th>\n", " <th>SBS53</th>\n", " <th>SBS54</th>\n", " <th>SBS55</th>\n", " <th>SBS56</th>\n", " <th>SBS57</th>\n", " <th>SBS58</th>\n", " <th>SBS59</th>\n", " <th>SBS60</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ALL</td>\n", " <td>TARGET-10-PAIXPH-03A-01D</td>\n", " <td>0.529</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ALL</td>\n", " <td>TARGET-10-PAKHZT-03A-01R</td>\n", " <td>0.696</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>2 rows × 68 columns</p>\n", "</div>" ], "text/plain": [ " Cancer Types Sample Names Accuracy SBS1 SBS2 SBS3 SBS4 \\\n", "0 ALL TARGET-10-PAIXPH-03A-01D 0.529 0 0 0 0 \n", "1 ALL TARGET-10-PAKHZT-03A-01R 0.696 0 0 0 0 \n", "\n", " SBS5 SBS6 SBS7a ... SBS51 SBS52 SBS53 SBS54 SBS55 SBS56 SBS57 \\\n", "0 0 0 0 ... 0 0 0 1 0 0 0 \n", "1 0 0 0 ... 0 0 0 1 0 0 0 \n", "\n", " SBS58 SBS59 SBS60 \n", "0 0 0 0 \n", "1 0 0 0 \n", "\n", "[2 rows x 68 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "other_wes_act = pd.read_csv(\"./project_data/activities/WES/WES_Other.activities.csv\")\n", "other_wes_act.head(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }