Created
October 5, 2020 18:40
-
-
Save lukauskas/d6972c73a9006208384bad7bf46502c1 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Liu et al, ABBS, 2009 - Table 3\n", | |
"\n", | |
"This notebook augments the data in Table 3 of (Liu, 2009) which lists genes upregulated in HEK293 cells upon depletion of REST. Particularly, we add Ensembl, Uniprot and Entrez Gene identifiers to the existing data in the table, which wehave fetched from [mygene.info](mygene.info) API (Xin, 2016).\n", | |
"\n", | |
"In case of multiple mappings of unigene identifiers, the data has been manually curated to keep only the mapping whose name matches the name in the publication. In case where one unigene ID mapped to multiple ensembl/uniprot IDs, the mappings were concatenated with a semicolon (`;`).\n", | |
"\n", | |
"References:\n", | |
"\n", | |
"* (Liu, 2009): Liu, Z., Liu, M., Niu, G., Cheng, Y., and Fei, J. (2009). Genome-wide identification of target genes repressed by the zinc finger transcription factor REST/NRSF in the HEK 293 cell line. Acta Bioch Bioph Sin 41, 1008–1017.\n", | |
"* (Xin, 2016) Xin, J., Mark, A., Afrasiabi, C., Tsueng, G., Juchler, M., Gopal, N., Stupp, G., Putman, T., Ainscough, B., Griffith, O., Torkamani, A., Whetzel, P., Mungall, C., Mooney, S., Su, A., Wu, C. (2016). High-performance web services for querying gene and variant annotation Genome Biology 17(1), 91. https://dx.doi.org/10.1186/s13059-016-0953-9" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%config InlineBackend.figure_format = 'retina'\n", | |
"%matplotlib inline\n", | |
"import os\n", | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"import seaborn as sns\n", | |
"from matplotlib import pyplot as plt\n", | |
"from tqdm import tqdm\n", | |
"sns.set_palette('Dark2')\n", | |
"sns.set_style({'axes.axisbelow': True, 'axes.edgecolor': '.15', 'axes.facecolor': 'white',\n", | |
" 'axes.grid': True, 'axes.labelcolor': '.15', 'axes.linewidth': 1.25, \n", | |
" 'figure.facecolor': 'white', 'font.family': ['sans-serif'], 'grid.color': '.15',\n", | |
" 'grid.linestyle': ':', 'grid.alpha': .5, 'image.cmap': 'Greys', \n", | |
" 'legend.frameon': False, 'legend.numpoints': 1, 'legend.scatterpoints': 1,\n", | |
" 'lines.solid_capstyle': 'round', 'axes.spines.right': False, 'axes.spines.top': False, \n", | |
" 'text.color': '.15', 'xtick.top': False, 'ytick.right': False, 'xtick.color': '.15',\n", | |
" 'xtick.direction': 'out', 'xtick.major.size': 6, 'xtick.minor.size': 3,\n", | |
" 'ytick.color': '.15', 'ytick.direction': 'out', 'ytick.major.size': 6,'ytick.minor.size': 3})\n", | |
"sns.set_context('talk')\n", | |
"\n", | |
"#http://phyletica.org/matplotlib-fonts/\n", | |
"import matplotlib\n", | |
"matplotlib.rcParams['pdf.fonttype'] = 42\n", | |
"matplotlib.rcParams['ps.fonttype'] = 42" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"data = pd.read_csv('liu-et-al-abbs-2009-table-3.tsv', sep='\\t')\n", | |
"data = data.set_index('Unigene')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Protein (Gene)</th>\n", | |
" <th>CHR</th>\n", | |
" <th>Ratio</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Unigene</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Hs.334370</th>\n", | |
" <td>Brain expressed, X-linked 1 (BEX1)</td>\n", | |
" <td>Xq22</td>\n", | |
" <td>5.55</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Hs.690634</th>\n", | |
" <td>Heat shock 70 kDa protein 1-like (HSPA1L)</td>\n", | |
" <td>NaN</td>\n", | |
" <td>5.43</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Hs.631926</th>\n", | |
" <td>Cadherin, EGF LAG seven-pass G-type receptor 3...</td>\n", | |
" <td>3p24.1-p21.2</td>\n", | |
" <td>4.92</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Hs.465506</th>\n", | |
" <td>Phosphatidic acid phosphatase type 2C (PPAP2C)</td>\n", | |
" <td>19p13</td>\n", | |
" <td>4.14</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Hs.432648</th>\n", | |
" <td>heat shock 70 kDa protein 2 (HSPA2)</td>\n", | |
" <td>14q24.1</td>\n", | |
" <td>3.97</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Protein (Gene) CHR \\\n", | |
"Unigene \n", | |
"Hs.334370 Brain expressed, X-linked 1 (BEX1) Xq22 \n", | |
"Hs.690634 Heat shock 70 kDa protein 1-like (HSPA1L) NaN \n", | |
"Hs.631926 Cadherin, EGF LAG seven-pass G-type receptor 3... 3p24.1-p21.2 \n", | |
"Hs.465506 Phosphatidic acid phosphatase type 2C (PPAP2C) 19p13 \n", | |
"Hs.432648 heat shock 70 kDa protein 2 (HSPA2) 14q24.1 \n", | |
"\n", | |
" Ratio \n", | |
"Unigene \n", | |
"Hs.334370 5.55 \n", | |
"Hs.690634 5.43 \n", | |
"Hs.631926 4.92 \n", | |
"Hs.465506 4.14 \n", | |
"Hs.432648 3.97 " | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Making sure there are 54 genes:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"assert len(data) == 54" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import mygene\n", | |
"mg = mygene.MyGeneInfo()\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 54/54 [00:21<00:00, 2.53it/s]" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"One-to-many hits:\n", | |
"['Hs.23581', 'Hs.274402', 'Hs.464391']\n", | |
"Data was downloaded from MyGene at 2020-10-05T20:29:34.374934\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"from datetime import datetime\n", | |
"\n", | |
"\n", | |
"mapping = []\n", | |
"\n", | |
"multihits = []\n", | |
"\n", | |
"for query in tqdm(data.index):\n", | |
" # For some reason querymany fails...\n", | |
" response = mg.query(f'unigene:{query}', \n", | |
" fields=['ensembl.gene', 'entrezgene', 'symbol', 'uniprot', 'unigene'])\n", | |
" \n", | |
" hits = response['hits']\n", | |
" if len(hits) > 1:\n", | |
" multihits.append(query)\n", | |
" \n", | |
" for hit in hits:\n", | |
" \n", | |
" if 'ensembl' not in hit:\n", | |
" ensembl = None\n", | |
" elif isinstance(hit['ensembl'], list):\n", | |
" ensembl = ';'.join([e['gene'] for e in hit['ensembl']])\n", | |
" else:\n", | |
" ensembl = hit['ensembl']['gene']\n", | |
" \n", | |
" if 'uniprot' not in hit:\n", | |
" uniprot = None\n", | |
" elif isinstance(hit['uniprot']['Swiss-Prot'], list):\n", | |
" uniprot = ';'.join(hit['uniprot']['Swiss-Prot'])\n", | |
" else:\n", | |
" uniprot = hit['uniprot']['Swiss-Prot']\n", | |
" \n", | |
" \n", | |
" mapping.append([query, \n", | |
" hit['symbol'], hit['entrezgene'], \n", | |
" ensembl, \n", | |
" uniprot])\n", | |
" \n", | |
"mapping = pd.DataFrame(mapping, columns=['Unigene', 'symbol', 'entrez_gene', 'ensembl_gene', 'uniprot'])\n", | |
"mapping = mapping.set_index(['Unigene', 'symbol'])\n", | |
" \n", | |
"print(\"One-to-many hits:\")\n", | |
"print(multihits)\n", | |
"\n", | |
"print('Data was downloaded from MyGene at {}'.format(datetime.now().isoformat()))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'took': 3,\n", | |
" 'total': 2,\n", | |
" 'max_score': 18.433449,\n", | |
" 'hits': [{'_id': '54741',\n", | |
" '_score': 18.433449,\n", | |
" 'ensembl': {'gene': 'ENSG00000213625'},\n", | |
" 'entrezgene': '54741',\n", | |
" 'symbol': 'LEPROT',\n", | |
" 'unigene': ['Hs.23581', 'Hs.258228'],\n", | |
" 'uniprot': {'Swiss-Prot': 'O15243',\n", | |
" 'TrEMBL': ['A0A087X0N2', 'A0A3B3ISI6', 'A0A3B3ISI8', 'A0A3B3ITV1']}},\n", | |
" {'_id': '3953',\n", | |
" '_score': 17.640072,\n", | |
" 'ensembl': {'gene': 'ENSG00000116678'},\n", | |
" 'entrezgene': '3953',\n", | |
" 'symbol': 'LEPR',\n", | |
" 'unigene': ['Hs.23581', 'Hs.723178'],\n", | |
" 'uniprot': {'Swiss-Prot': 'P48357', 'TrEMBL': 'Q4G138'}}]}" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"mg.query('unigene:Hs.23581', fields=['ensembl.gene', 'entrezgene', 'symbol', 'uniprot', 'unigene'])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Some manual curation of one-to-many hits based on gene names:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th>entrez_gene</th>\n", | |
" <th>ensembl_gene</th>\n", | |
" <th>uniprot</th>\n", | |
" <th>Protein (Gene)</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Unigene</th>\n", | |
" <th>symbol</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th rowspan=\"2\" valign=\"top\">Hs.23581</th>\n", | |
" <th>LEPROT</th>\n", | |
" <td>54741</td>\n", | |
" <td>ENSG00000213625</td>\n", | |
" <td>O15243</td>\n", | |
" <td>leptin receptor (LEPR)</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>LEPR</th>\n", | |
" <td>3953</td>\n", | |
" <td>ENSG00000116678</td>\n", | |
" <td>P48357</td>\n", | |
" <td>leptin receptor (LEPR)</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th rowspan=\"2\" valign=\"top\">Hs.274402</th>\n", | |
" <th>HSPA1A</th>\n", | |
" <td>3303</td>\n", | |
" <td>ENSG00000237724;ENSG00000204389;ENSG0000021532...</td>\n", | |
" <td>P0DMV8;P0DMV9</td>\n", | |
" <td>heat shock 70 kDa protein 1B (HSPA1B)</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>HSPA1B</th>\n", | |
" <td>3304</td>\n", | |
" <td>ENSG00000232804;ENSG00000204388;ENSG0000023155...</td>\n", | |
" <td>P0DMV8;P0DMV9</td>\n", | |
" <td>heat shock 70 kDa protein 1B (HSPA1B)</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th rowspan=\"2\" valign=\"top\">Hs.464391</th>\n", | |
" <th>TBCD</th>\n", | |
" <td>6904</td>\n", | |
" <td>ENSG00000278759;ENSG00000141556</td>\n", | |
" <td>Q9BTW9</td>\n", | |
" <td>tubulin-specific chaperone d (TBCD)</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>ZNF750</th>\n", | |
" <td>79755</td>\n", | |
" <td>ENSG00000141579</td>\n", | |
" <td>Q32MQ0</td>\n", | |
" <td>tubulin-specific chaperone d (TBCD)</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" entrez_gene \\\n", | |
"Unigene symbol \n", | |
"Hs.23581 LEPROT 54741 \n", | |
" LEPR 3953 \n", | |
"Hs.274402 HSPA1A 3303 \n", | |
" HSPA1B 3304 \n", | |
"Hs.464391 TBCD 6904 \n", | |
" ZNF750 79755 \n", | |
"\n", | |
" ensembl_gene \\\n", | |
"Unigene symbol \n", | |
"Hs.23581 LEPROT ENSG00000213625 \n", | |
" LEPR ENSG00000116678 \n", | |
"Hs.274402 HSPA1A ENSG00000237724;ENSG00000204389;ENSG0000021532... \n", | |
" HSPA1B ENSG00000232804;ENSG00000204388;ENSG0000023155... \n", | |
"Hs.464391 TBCD ENSG00000278759;ENSG00000141556 \n", | |
" ZNF750 ENSG00000141579 \n", | |
"\n", | |
" uniprot Protein (Gene) \n", | |
"Unigene symbol \n", | |
"Hs.23581 LEPROT O15243 leptin receptor (LEPR) \n", | |
" LEPR P48357 leptin receptor (LEPR) \n", | |
"Hs.274402 HSPA1A P0DMV8;P0DMV9 heat shock 70 kDa protein 1B (HSPA1B) \n", | |
" HSPA1B P0DMV8;P0DMV9 heat shock 70 kDa protein 1B (HSPA1B) \n", | |
"Hs.464391 TBCD Q9BTW9 tubulin-specific chaperone d (TBCD) \n", | |
" ZNF750 Q32MQ0 tubulin-specific chaperone d (TBCD) " | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"mapping.loc[multihits].join(data['Protein (Gene)'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"delete_hits = {\n", | |
" ('Hs.23581', 'LEPROT'),\n", | |
" ('Hs.274402', 'HSPA1A'),\n", | |
" ('Hs.464391', 'ZNF750')\n", | |
"}\n", | |
"\n", | |
"mapping = mapping.drop(delete_hits)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th>entrez_gene</th>\n", | |
" <th>ensembl_gene</th>\n", | |
" <th>uniprot</th>\n", | |
" <th>Protein (Gene)</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Unigene</th>\n", | |
" <th>symbol</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Hs.23581</th>\n", | |
" <th>LEPR</th>\n", | |
" <td>3953</td>\n", | |
" <td>ENSG00000116678</td>\n", | |
" <td>P48357</td>\n", | |
" <td>leptin receptor (LEPR)</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Hs.274402</th>\n", | |
" <th>HSPA1B</th>\n", | |
" <td>3304</td>\n", | |
" <td>ENSG00000232804;ENSG00000204388;ENSG0000023155...</td>\n", | |
" <td>P0DMV8;P0DMV9</td>\n", | |
" <td>heat shock 70 kDa protein 1B (HSPA1B)</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Hs.464391</th>\n", | |
" <th>TBCD</th>\n", | |
" <td>6904</td>\n", | |
" <td>ENSG00000278759;ENSG00000141556</td>\n", | |
" <td>Q9BTW9</td>\n", | |
" <td>tubulin-specific chaperone d (TBCD)</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" entrez_gene \\\n", | |
"Unigene symbol \n", | |
"Hs.23581 LEPR 3953 \n", | |
"Hs.274402 HSPA1B 3304 \n", | |
"Hs.464391 TBCD 6904 \n", | |
"\n", | |
" ensembl_gene \\\n", | |
"Unigene symbol \n", | |
"Hs.23581 LEPR ENSG00000116678 \n", | |
"Hs.274402 HSPA1B ENSG00000232804;ENSG00000204388;ENSG0000023155... \n", | |
"Hs.464391 TBCD ENSG00000278759;ENSG00000141556 \n", | |
"\n", | |
" uniprot Protein (Gene) \n", | |
"Unigene symbol \n", | |
"Hs.23581 LEPR P48357 leptin receptor (LEPR) \n", | |
"Hs.274402 HSPA1B P0DMV8;P0DMV9 heat shock 70 kDa protein 1B (HSPA1B) \n", | |
"Hs.464391 TBCD Q9BTW9 tubulin-specific chaperone d (TBCD) " | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"mapping.loc[multihits].join(data['Protein (Gene)'])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Create augmented DF:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"augmented_df = pd.merge(data.reset_index(), \n", | |
" mapping.reset_index(),\n", | |
" left_on='Unigene', \n", | |
" right_on='Unigene',\n", | |
" how='left').set_index('Unigene')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"scrolled": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Protein (Gene)</th>\n", | |
" <th>CHR</th>\n", | |
" <th>Ratio</th>\n", | |
" <th>symbol</th>\n", | |
" <th>entrez_gene</th>\n", | |
" <th>ensembl_gene</th>\n", | |
" <th>uniprot</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Unigene</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Hs.334370</th>\n", | |
" <td>Brain expressed, X-linked 1 (BEX1)</td>\n", | |
" <td>Xq22</td>\n", | |
" <td>5.55</td>\n", | |
" <td>BEX1</td>\n", | |
" <td>55859</td>\n", | |
" <td>ENSG00000133169</td>\n", | |
" <td>Q9HBH7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Hs.690634</th>\n", | |
" <td>Heat shock 70 kDa protein 1-like (HSPA1L)</td>\n", | |
" <td>NaN</td>\n", | |
" <td>5.43</td>\n", | |
" <td>HSPA1L</td>\n", | |
" <td>3305</td>\n", | |
" <td>ENSG00000236251;ENSG00000226704;ENSG0000023425...</td>\n", | |
" <td>P34931</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Hs.631926</th>\n", | |
" <td>Cadherin, EGF LAG seven-pass G-type receptor 3...</td>\n", | |
" <td>3p24.1-p21.2</td>\n", | |
" <td>4.92</td>\n", | |
" <td>CELSR3</td>\n", | |
" <td>1951</td>\n", | |
" <td>ENSG00000008300</td>\n", | |
" <td>Q9NYQ7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Hs.465506</th>\n", | |
" <td>Phosphatidic acid phosphatase type 2C (PPAP2C)</td>\n", | |
" <td>19p13</td>\n", | |
" <td>4.14</td>\n", | |
" <td>PLPP2</td>\n", | |
" <td>8612</td>\n", | |
" <td>ENSG00000141934</td>\n", | |
" <td>O43688</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Hs.432648</th>\n", | |
" <td>heat shock 70 kDa protein 2 (HSPA2)</td>\n", | |
" <td>14q24.1</td>\n", | |
" <td>3.97</td>\n", | |
" <td>HSPA2</td>\n", | |
" <td>3306</td>\n", | |
" <td>ENSG00000126803</td>\n", | |
" <td>P54652</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Protein (Gene) CHR \\\n", | |
"Unigene \n", | |
"Hs.334370 Brain expressed, X-linked 1 (BEX1) Xq22 \n", | |
"Hs.690634 Heat shock 70 kDa protein 1-like (HSPA1L) NaN \n", | |
"Hs.631926 Cadherin, EGF LAG seven-pass G-type receptor 3... 3p24.1-p21.2 \n", | |
"Hs.465506 Phosphatidic acid phosphatase type 2C (PPAP2C) 19p13 \n", | |
"Hs.432648 heat shock 70 kDa protein 2 (HSPA2) 14q24.1 \n", | |
"\n", | |
" Ratio symbol entrez_gene \\\n", | |
"Unigene \n", | |
"Hs.334370 5.55 BEX1 55859 \n", | |
"Hs.690634 5.43 HSPA1L 3305 \n", | |
"Hs.631926 4.92 CELSR3 1951 \n", | |
"Hs.465506 4.14 PLPP2 8612 \n", | |
"Hs.432648 3.97 HSPA2 3306 \n", | |
"\n", | |
" ensembl_gene uniprot \n", | |
"Unigene \n", | |
"Hs.334370 ENSG00000133169 Q9HBH7 \n", | |
"Hs.690634 ENSG00000236251;ENSG00000226704;ENSG0000023425... P34931 \n", | |
"Hs.631926 ENSG00000008300 Q9NYQ7 \n", | |
"Hs.465506 ENSG00000141934 O43688 \n", | |
"Hs.432648 ENSG00000126803 P54652 " | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"augmented_df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"augmented_df.to_csv('liu-et-al-abbs-2009-table-3-augmented.tsv', sep='\\t')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"-----\n", | |
"matplotlib 3.3.2\n", | |
"mygene 3.1.0\n", | |
"numpy 1.19.2\n", | |
"pandas 1.1.2\n", | |
"seaborn 0.11.0\n", | |
"sinfo 0.3.1\n", | |
"tqdm 4.50.0\n", | |
"-----\n", | |
"IPython 7.18.1\n", | |
"jupyter_client 6.1.7\n", | |
"jupyter_core 4.6.3\n", | |
"notebook 6.1.4\n", | |
"-----\n", | |
"Python 3.8.5 (default, Jul 21 2020, 10:42:08) [Clang 11.0.0 (clang-1100.0.33.17)]\n", | |
"macOS-10.14.6-x86_64-i386-64bit\n", | |
"8 logical CPU cores, i386\n", | |
"-----\n", | |
"Session information updated at 2020-10-05 20:29\n" | |
] | |
} | |
], | |
"source": [ | |
"from sinfo import sinfo\n", | |
"sinfo(req_file_name='requirements.txt')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Unigene | Protein (Gene) | CHR | Ratio | symbol | entrez_gene | ensembl_gene | uniprot | |
---|---|---|---|---|---|---|---|---|
Hs.334370 | Brain expressed, X-linked 1 (BEX1) | Xq22 | 5.55 | BEX1 | 55859 | ENSG00000133169 | Q9HBH7 | |
Hs.690634 | Heat shock 70 kDa protein 1-like (HSPA1L) | 5.43 | HSPA1L | 3305 | ENSG00000236251;ENSG00000226704;ENSG00000234258;ENSG00000206383;ENSG00000204390 | P34931 | ||
Hs.631926 | Cadherin, EGF LAG seven-pass G-type receptor 3, flamingoDrosophilahomolog (CELSR3) | 3p24.1-p21.2 | 4.92 | CELSR3 | 1951 | ENSG00000008300 | Q9NYQ7 | |
Hs.465506 | Phosphatidic acid phosphatase type 2C (PPAP2C) | 19p13 | 4.14 | PLPP2 | 8612 | ENSG00000141934 | O43688 | |
Hs.432648 | heat shock 70 kDa protein 2 (HSPA2) | 14q24.1 | 3.97 | HSPA2 | 3306 | ENSG00000126803 | P54652 | |
Hs.69089 | galactosidase, _ (GLA) | Xq22 | 3.85 | GLA | 2717 | ENSG00000102393 | P06280 | |
Hs.433750 | eukaryotic translation initiation factor 4 _, 1 (EIF4G1) | 3q27-qter | 3.8 | EIF4G1 | 1981 | ENSG00000114867 | Q04637 | |
Hs.533628 | KIAA0133 | 3.75 | URB2 | 9816 | ENSG00000135763 | Q14146 | ||
Hs.279929 | Transmembrane emp24 protein transport domain containing 9 (TMED9) | 5q35.3 | 3.44 | TMED9 | 54732 | ENSG00000184840 | Q9BVK6 | |
Hs.516874 | chromogranin B (CHGB) | 20pter-p12 | 3.42 | CHGB | 1114 | ENSG00000089199 | P05060 | |
Hs.74565 | amyloid _ precursor-like protein 1 (APLP1) | 19q13.1 | 3.4 | APLP1 | 333 | ENSG00000105290 | P51693 | |
Hs.650382 | RAB5C, member RAS oncogene family (RAB5C) | 17q21.2 | 3.28 | RAB5C | 5878 | ENSG00000108774 | P51148 | |
Hs.23581 | leptin receptor (LEPR) | 1 | 3.25 | LEPR | 3953 | ENSG00000116678 | P48357 | |
Hs.518403 | Phosphatidylinositol glycan anchor biosynthesis class Z (PIGZ) | 3q29 | 3.24 | PIGZ | 80235 | ENSG00000119227 | Q86VD9 | |
Hs.313 | secreted phosphoprotein 1 (SPP1) | 4q21-q25 | 3.12 | SPP1 | 6696 | ENSG00000118785 | P10451 | |
Hs.232618 | secretogranin III (SCG3) | 15q21 | 2.87 | SCG3 | 29106 | ENSG00000104112 | Q8WXD2 | |
Hs.47166 | Chromosome 3 open reading frame 14 (C3orf14) | 3p14.3 | 2.86 | C3orf14 | 57415 | ENSG00000114405 | Q9HBI5 | |
Hs.514527 | baculoviral IAP repeat-containing 5 (BIRC5) | 17q25 | 2.86 | |||||
Hs.530003 | solute carrier family 2 member 5 (SLC2A5) | 1p36.2 | 2.84 | SLC2A5 | 6518 | ENSG00000142583 | P22732 | |
Hs.496068 | PCTAIRE protein kinase 1 (PCTK1) | xp11.3-p11.23 | 2.84 | CDK16 | 5127 | ENSG00000102225 | Q00536 | |
Hs.632642 | phosphoglycerate mutase 2 (muscle) (PGAM2) | 7p13-p12 | 2.79 | PGAM2 | 5224 | ENSG00000164708 | P15259 | |
Hs.502842 | calpain 1, (mu/I) large subunit (CAPN1) | 11q13 | 2.71 | CAPN1 | 823 | ENSG00000014216 | P07384 | |
Hs.655168 | Sideroflexin 4 (SFXN4) | 10 | 2.61 | SFXN4 | 119559 | ENSG00000183605 | Q6P4A7 | |
Hs.645248 | phosphate cytidylyltransferase 2 (PCYT2) | 17q25.3 | 2.58 | |||||
Hs.465985 | arsA (bacterial) arsenite transporter homolog 1 (ASNA1) | 19q13.3 | 2.51 | GET3 | 439 | ENSG00000198356 | O43681 | |
Hs.527295 | ectonucleotide pyrophosphatase/phosphodiesterase 1 (ENPP1) | 6q22-q23 | 2.5 | ENPP1 | 5167 | ENSG00000197594 | P22413 | |
Hs.5148 | TRAF-type zinc finger domain containing 1 (TRAFD1) | 12q | 2.49 | TRAFD1 | 10906 | ENSG00000135148 | O14545 | |
Hs.194754 | Chromosome 1 open reading frame 107 (C1orf107) | 1q32.3-q41 | 2.42 | UTP25 | 27042 | ENSG00000117597 | Q68CQ4 | |
Hs.90303 | tuberous sclerosis 2 (TSC2) | 16p13.3 | 2.4 | TSC2 | 7249 | ENSG00000103197 | P49815 | |
Hs.631655 | Leprecan-like 2 (LEPREL2) | 12q13 | 2.35 | P3H3 | 10536 | ENSG00000110811 | Q8IVL6 | |
Hs.274402 | heat shock 70 kDa protein 1B (HSPA1B) | 6p21.3 | 2.34 | HSPA1B | 3304 | ENSG00000232804;ENSG00000204388;ENSG00000231555;ENSG00000212866;ENSG00000224501 | P0DMV8;P0DMV9 | |
Hs.643454 | dystrobrevin, _ #DTNA#, transcript variant DTN2 (DTNA) | 18q12 | 2.34 | DTNA | 1837 | ENSG00000134769 | Q9Y4J8 | |
Hs.500375 | ectonucleoside triphosphate diphosphohydrolase 6 (putative) (ENTPD6) | 2.33 | ENTPD6 | 955 | ENSG00000197586 | O75354 | ||
Hs.500916 | internexin neuronal intermediate filament protein, alpha (INA) | 10q25.1 | 2.32 | INA | 9118 | ENSG00000148798 | Q16352 | |
Hs.654401 | IMP (inosine monophosphate) dehydrogenase 1 (IMPDH1) | 7q31.3-q32 | 2.32 | IMPDH1 | 3614 | ENSG00000106348 | P20839 | |
Hs.631992 | Dystonin (DST) | 2.31 | ||||||
Hs.135270 | collapsin response mediator protein 1 (CRMP1) | 4p16.1-p15 | 2.3 | CRMP1 | 1400 | ENSG00000072832 | Q14194 | |
Hs.524947 | CDC20 cell division cycle 20S. cerevisiaehomolog (CDC20) | 9q13-q21 | 2.28 | CDC20 | 991 | ENSG00000117399 | Q12834 | |
Hs.651212 | Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, _ polypeptide (YWHAB) | 2.27 | ||||||
Hs.500197 | RaP2 interacting protein 8 (RPIP8) | 17q21.31 | 2.26 | RUNDC3A | 10900 | ENSG00000108309 | Q59EK9 | |
Hs.533772 | Meteorin, glial cell differentiation regulator (METRN) | 16 | 2.24 | METRN | 79006 | ENSG00000103260 | Q9UJH8 | |
Hs.576875 | DEAD/H box polypeptide 21 (DDX21) | 10q21 | 2.22 | |||||
Hs.435255 | UBX domain-containing 1 (UBXD1) | 19p13 | 2.22 | UBXN6 | 80700 | ENSG00000167671 | Q9BZV1 | |
Hs.464391 | tubulin-specific chaperone d (TBCD) | 17 | 2.21 | TBCD | 6904 | ENSG00000278759;ENSG00000141556 | Q9BTW9 | |
Hs.479439 | BH-protocadherin (brain-heart) (PCDH7) | 4p15 | 2.19 | PCDH7 | 5099 | ENSG00000169851 | O60245 | |
Hs.591234 | gap junction protein, _2 (GJB2) | 13q11-q12 | 2.18 | |||||
Hs.528572 | vinexin _ (SH3-containing adaptor molecule-1) (SORBS3) | 8p21.2 | 2.17 | SORBS3 | 10174 | ENSG00000120896 | O60504 | |
Hs.464336 | procollagen-proline, 2-oxoglutarate 4-dioxygenase(proline 4-hydroxylase), _ polypeptide (P4HB) | 17q25 | 2.16 | P4HB | 5034 | ENSG00000185624 | P07237 | |
Hs.9194 | UBA domain containing 1 (UBAC1) | 9q34.3 | 2.11 | UBAC1 | 10422 | ENSG00000130560 | Q9BSL1 | |
Hs.518414 | Leucine-rich repeat and calponin homology domain containing 3 (LRCH3) | 2.1 | LRCH3 | 84859 | ENSG00000186001 | Q96II8 | ||
Hs.520028 | heat shock 70 kDa protein 1A (HSPA1A) | 6p21.3 | 2.09 | |||||
Hs.808 | heterogeneous nuclear ribonucleoprotein F (HNRPF) | 2.08 | HNRNPF | 3185 | ENSG00000169813 | P52597 | ||
Hs.23413 | ATAD3B220 sequences | 1p36.32 | 2.08 | ATAD3A | 55210 | ENSG00000197785 | Q9NVI7 | |
Hs.355934 | splicing factor proline/glutamine rich | 1pter-p32.3 | 2.06 | SFPQ | 6421 | ENSG00000116560 | P23246 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Unigene | Protein (Gene) | CHR | Ratio | |
---|---|---|---|---|
Hs.334370 | Brain expressed, X-linked 1 (BEX1) | Xq22 | 5.55 | |
Hs.690634 | Heat shock 70 kDa protein 1-like (HSPA1L) | 5.43 | ||
Hs.631926 | Cadherin, EGF LAG seven-pass G-type receptor 3, flamingoDrosophilahomolog (CELSR3) | 3p24.1-p21.2 | 4.92 | |
Hs.465506 | Phosphatidic acid phosphatase type 2C (PPAP2C) | 19p13 | 4.14 | |
Hs.432648 | heat shock 70 kDa protein 2 (HSPA2) | 14q24.1 | 3.97 | |
Hs.69089 | galactosidase, _ (GLA) | Xq22 | 3.85 | |
Hs.433750 | eukaryotic translation initiation factor 4 _, 1 (EIF4G1) | 3q27-qter | 3.80 | |
Hs.533628 | KIAA0133 | 3.75 | ||
Hs.279929 | Transmembrane emp24 protein transport domain containing 9 (TMED9) | 5q35.3 | 3.44 | |
Hs.516874 | chromogranin B (CHGB) | 20pter-p12 | 3.42 | |
Hs.74565 | amyloid _ precursor-like protein 1 (APLP1) | 19q13.1 | 3.40 | |
Hs.650382 | RAB5C, member RAS oncogene family (RAB5C) | 17q21.2 | 3.28 | |
Hs.23581 | leptin receptor (LEPR) | 1 | 3.25 | |
Hs.518403 | Phosphatidylinositol glycan anchor biosynthesis class Z (PIGZ) | 3q29 | 3.24 | |
Hs.313 | secreted phosphoprotein 1 (SPP1) | 4q21-q25 | 3.12 | |
Hs.232618 | secretogranin III (SCG3) | 15q21 | 2.87 | |
Hs.47166 | Chromosome 3 open reading frame 14 (C3orf14) | 3p14.3 | 2.86 | |
Hs.514527 | baculoviral IAP repeat-containing 5 (BIRC5) | 17q25 | 2.86 | |
Hs.530003 | solute carrier family 2 member 5 (SLC2A5) | 1p36.2 | 2.84 | |
Hs.496068 | PCTAIRE protein kinase 1 (PCTK1) | xp11.3-p11.23 | 2.84 | |
Hs.632642 | phosphoglycerate mutase 2 (muscle) (PGAM2) | 7p13-p12 | 2.79 | |
Hs.502842 | calpain 1, (mu/I) large subunit (CAPN1) | 11q13 | 2.71 | |
Hs.655168 | Sideroflexin 4 (SFXN4) | 10 | 2.61 | |
Hs.645248 | phosphate cytidylyltransferase 2 (PCYT2) | 17q25.3 | 2.58 | |
Hs.465985 | arsA (bacterial) arsenite transporter homolog 1 (ASNA1) | 19q13.3 | 2.51 | |
Hs.527295 | ectonucleotide pyrophosphatase/phosphodiesterase 1 (ENPP1) | 6q22-q23 | 2.50 | |
Hs.5148 | TRAF-type zinc finger domain containing 1 (TRAFD1) | 12q | 2.49 | |
Hs.194754 | Chromosome 1 open reading frame 107 (C1orf107) | 1q32.3-q41 | 2.42 | |
Hs.90303 | tuberous sclerosis 2 (TSC2) | 16p13.3 | 2.40 | |
Hs.631655 | Leprecan-like 2 (LEPREL2) | 12q13 | 2.35 | |
Hs.274402 | heat shock 70 kDa protein 1B (HSPA1B) | 6p21.3 | 2.34 | |
Hs.643454 | dystrobrevin, _ #DTNA#, transcript variant DTN2 (DTNA) | 18q12 | 2.34 | |
Hs.500375 | ectonucleoside triphosphate diphosphohydrolase 6 (putative) (ENTPD6) | 2.33 | ||
Hs.500916 | internexin neuronal intermediate filament protein, alpha (INA) | 10q25.1 | 2.32 | |
Hs.654401 | IMP (inosine monophosphate) dehydrogenase 1 (IMPDH1) | 7q31.3-q32 | 2.32 | |
Hs.631992 | Dystonin (DST) | 2.31 | ||
Hs.135270 | collapsin response mediator protein 1 (CRMP1) | 4p16.1-p15 | 2.30 | |
Hs.524947 | CDC20 cell division cycle 20S. cerevisiaehomolog (CDC20) | 9q13-q21 | 2.28 | |
Hs.651212 | Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, _ polypeptide (YWHAB) | 2.27 | ||
Hs.500197 | RaP2 interacting protein 8 (RPIP8) | 17q21.31 | 2.26 | |
Hs.533772 | Meteorin, glial cell differentiation regulator (METRN) | 16 | 2.24 | |
Hs.576875 | DEAD/H box polypeptide 21 (DDX21) | 10q21 | 2.22 | |
Hs.435255 | UBX domain-containing 1 (UBXD1) | 19p13 | 2.22 | |
Hs.464391 | tubulin-specific chaperone d (TBCD) | 17 | 2.21 | |
Hs.479439 | BH-protocadherin (brain-heart) (PCDH7) | 4p15 | 2.19 | |
Hs.591234 | gap junction protein, _2 (GJB2) | 13q11-q12 | 2.18 | |
Hs.528572 | vinexin _ (SH3-containing adaptor molecule-1) (SORBS3) | 8p21.2 | 2.17 | |
Hs.464336 | procollagen-proline, 2-oxoglutarate 4-dioxygenase(proline 4-hydroxylase), _ polypeptide (P4HB) | 17q25 | 2.16 | |
Hs.9194 | UBA domain containing 1 (UBAC1) | 9q34.3 | 2.11 | |
Hs.518414 | Leucine-rich repeat and calponin homology domain containing 3 (LRCH3) | 2.10 | ||
Hs.520028 | heat shock 70 kDa protein 1A (HSPA1A) | 6p21.3 | 2.09 | |
Hs.808 | heterogeneous nuclear ribonucleoprotein F (HNRPF) | 2.08 | ||
Hs.23413 | ATAD3B220 sequences | 1p36.32 | 2.08 | |
Hs.355934 | splicing factor proline/glutamine rich | 1pter-p32.3 | 2.06 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
matplotlib==3.3.2 | |
mygene==3.1.0 | |
numpy==1.19.2 | |
pandas==1.1.2 | |
seaborn==0.11.0 | |
sinfo==0.3.1 | |
tqdm==4.50.0 | |
IPython==7.18.1 | |
jupyter_client==6.1.7 | |
jupyter_core==4.6.3 | |
notebook==6.1.4 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment