Skip to content

Instantly share code, notes, and snippets.

@lukauskas
Created October 5, 2020 18:40
Show Gist options
  • Save lukauskas/d6972c73a9006208384bad7bf46502c1 to your computer and use it in GitHub Desktop.
Save lukauskas/d6972c73a9006208384bad7bf46502c1 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Liu et al, ABBS, 2009 - Table 3\n",
"\n",
"This notebook augments the data in Table 3 of (Liu, 2009) which lists genes upregulated in HEK293 cells upon depletion of REST. Particularly, we add Ensembl, Uniprot and Entrez Gene identifiers to the existing data in the table, which wehave fetched from [mygene.info](mygene.info) API (Xin, 2016).\n",
"\n",
"In case of multiple mappings of unigene identifiers, the data has been manually curated to keep only the mapping whose name matches the name in the publication. In case where one unigene ID mapped to multiple ensembl/uniprot IDs, the mappings were concatenated with a semicolon (`;`).\n",
"\n",
"References:\n",
"\n",
"* (Liu, 2009): Liu, Z., Liu, M., Niu, G., Cheng, Y., and Fei, J. (2009). Genome-wide identification of target genes repressed by the zinc finger transcription factor REST/NRSF in the HEK 293 cell line. Acta Bioch Bioph Sin 41, 1008–1017.\n",
"* (Xin, 2016) Xin, J., Mark, A., Afrasiabi, C., Tsueng, G., Juchler, M., Gopal, N., Stupp, G., Putman, T., Ainscough, B., Griffith, O., Torkamani, A., Whetzel, P., Mungall, C., Mooney, S., Su, A., Wu, C. (2016). High-performance web services for querying gene and variant annotation Genome Biology 17(1), 91. https://dx.doi.org/10.1186/s13059-016-0953-9"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%config InlineBackend.figure_format = 'retina'\n",
"%matplotlib inline\n",
"import os\n",
"import pandas as pd\n",
"import numpy as np\n",
"import seaborn as sns\n",
"from matplotlib import pyplot as plt\n",
"from tqdm import tqdm\n",
"sns.set_palette('Dark2')\n",
"sns.set_style({'axes.axisbelow': True, 'axes.edgecolor': '.15', 'axes.facecolor': 'white',\n",
" 'axes.grid': True, 'axes.labelcolor': '.15', 'axes.linewidth': 1.25, \n",
" 'figure.facecolor': 'white', 'font.family': ['sans-serif'], 'grid.color': '.15',\n",
" 'grid.linestyle': ':', 'grid.alpha': .5, 'image.cmap': 'Greys', \n",
" 'legend.frameon': False, 'legend.numpoints': 1, 'legend.scatterpoints': 1,\n",
" 'lines.solid_capstyle': 'round', 'axes.spines.right': False, 'axes.spines.top': False, \n",
" 'text.color': '.15', 'xtick.top': False, 'ytick.right': False, 'xtick.color': '.15',\n",
" 'xtick.direction': 'out', 'xtick.major.size': 6, 'xtick.minor.size': 3,\n",
" 'ytick.color': '.15', 'ytick.direction': 'out', 'ytick.major.size': 6,'ytick.minor.size': 3})\n",
"sns.set_context('talk')\n",
"\n",
"#http://phyletica.org/matplotlib-fonts/\n",
"import matplotlib\n",
"matplotlib.rcParams['pdf.fonttype'] = 42\n",
"matplotlib.rcParams['ps.fonttype'] = 42"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv('liu-et-al-abbs-2009-table-3.tsv', sep='\\t')\n",
"data = data.set_index('Unigene')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Protein (Gene)</th>\n",
" <th>CHR</th>\n",
" <th>Ratio</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Unigene</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Hs.334370</th>\n",
" <td>Brain expressed, X-linked 1 (BEX1)</td>\n",
" <td>Xq22</td>\n",
" <td>5.55</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hs.690634</th>\n",
" <td>Heat shock 70 kDa protein 1-like (HSPA1L)</td>\n",
" <td>NaN</td>\n",
" <td>5.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hs.631926</th>\n",
" <td>Cadherin, EGF LAG seven-pass G-type receptor 3...</td>\n",
" <td>3p24.1-p21.2</td>\n",
" <td>4.92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hs.465506</th>\n",
" <td>Phosphatidic acid phosphatase type 2C (PPAP2C)</td>\n",
" <td>19p13</td>\n",
" <td>4.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hs.432648</th>\n",
" <td>heat shock 70 kDa protein 2 (HSPA2)</td>\n",
" <td>14q24.1</td>\n",
" <td>3.97</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Protein (Gene) CHR \\\n",
"Unigene \n",
"Hs.334370 Brain expressed, X-linked 1 (BEX1) Xq22 \n",
"Hs.690634 Heat shock 70 kDa protein 1-like (HSPA1L) NaN \n",
"Hs.631926 Cadherin, EGF LAG seven-pass G-type receptor 3... 3p24.1-p21.2 \n",
"Hs.465506 Phosphatidic acid phosphatase type 2C (PPAP2C) 19p13 \n",
"Hs.432648 heat shock 70 kDa protein 2 (HSPA2) 14q24.1 \n",
"\n",
" Ratio \n",
"Unigene \n",
"Hs.334370 5.55 \n",
"Hs.690634 5.43 \n",
"Hs.631926 4.92 \n",
"Hs.465506 4.14 \n",
"Hs.432648 3.97 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Making sure there are 54 genes:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"assert len(data) == 54"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import mygene\n",
"mg = mygene.MyGeneInfo()\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 54/54 [00:21<00:00, 2.53it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"One-to-many hits:\n",
"['Hs.23581', 'Hs.274402', 'Hs.464391']\n",
"Data was downloaded from MyGene at 2020-10-05T20:29:34.374934\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"from datetime import datetime\n",
"\n",
"\n",
"mapping = []\n",
"\n",
"multihits = []\n",
"\n",
"for query in tqdm(data.index):\n",
" # For some reason querymany fails...\n",
" response = mg.query(f'unigene:{query}', \n",
" fields=['ensembl.gene', 'entrezgene', 'symbol', 'uniprot', 'unigene'])\n",
" \n",
" hits = response['hits']\n",
" if len(hits) > 1:\n",
" multihits.append(query)\n",
" \n",
" for hit in hits:\n",
" \n",
" if 'ensembl' not in hit:\n",
" ensembl = None\n",
" elif isinstance(hit['ensembl'], list):\n",
" ensembl = ';'.join([e['gene'] for e in hit['ensembl']])\n",
" else:\n",
" ensembl = hit['ensembl']['gene']\n",
" \n",
" if 'uniprot' not in hit:\n",
" uniprot = None\n",
" elif isinstance(hit['uniprot']['Swiss-Prot'], list):\n",
" uniprot = ';'.join(hit['uniprot']['Swiss-Prot'])\n",
" else:\n",
" uniprot = hit['uniprot']['Swiss-Prot']\n",
" \n",
" \n",
" mapping.append([query, \n",
" hit['symbol'], hit['entrezgene'], \n",
" ensembl, \n",
" uniprot])\n",
" \n",
"mapping = pd.DataFrame(mapping, columns=['Unigene', 'symbol', 'entrez_gene', 'ensembl_gene', 'uniprot'])\n",
"mapping = mapping.set_index(['Unigene', 'symbol'])\n",
" \n",
"print(\"One-to-many hits:\")\n",
"print(multihits)\n",
"\n",
"print('Data was downloaded from MyGene at {}'.format(datetime.now().isoformat()))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'took': 3,\n",
" 'total': 2,\n",
" 'max_score': 18.433449,\n",
" 'hits': [{'_id': '54741',\n",
" '_score': 18.433449,\n",
" 'ensembl': {'gene': 'ENSG00000213625'},\n",
" 'entrezgene': '54741',\n",
" 'symbol': 'LEPROT',\n",
" 'unigene': ['Hs.23581', 'Hs.258228'],\n",
" 'uniprot': {'Swiss-Prot': 'O15243',\n",
" 'TrEMBL': ['A0A087X0N2', 'A0A3B3ISI6', 'A0A3B3ISI8', 'A0A3B3ITV1']}},\n",
" {'_id': '3953',\n",
" '_score': 17.640072,\n",
" 'ensembl': {'gene': 'ENSG00000116678'},\n",
" 'entrezgene': '3953',\n",
" 'symbol': 'LEPR',\n",
" 'unigene': ['Hs.23581', 'Hs.723178'],\n",
" 'uniprot': {'Swiss-Prot': 'P48357', 'TrEMBL': 'Q4G138'}}]}"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mg.query('unigene:Hs.23581', fields=['ensembl.gene', 'entrezgene', 'symbol', 'uniprot', 'unigene'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some manual curation of one-to-many hits based on gene names:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>entrez_gene</th>\n",
" <th>ensembl_gene</th>\n",
" <th>uniprot</th>\n",
" <th>Protein (Gene)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Unigene</th>\n",
" <th>symbol</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">Hs.23581</th>\n",
" <th>LEPROT</th>\n",
" <td>54741</td>\n",
" <td>ENSG00000213625</td>\n",
" <td>O15243</td>\n",
" <td>leptin receptor (LEPR)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>LEPR</th>\n",
" <td>3953</td>\n",
" <td>ENSG00000116678</td>\n",
" <td>P48357</td>\n",
" <td>leptin receptor (LEPR)</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">Hs.274402</th>\n",
" <th>HSPA1A</th>\n",
" <td>3303</td>\n",
" <td>ENSG00000237724;ENSG00000204389;ENSG0000021532...</td>\n",
" <td>P0DMV8;P0DMV9</td>\n",
" <td>heat shock 70 kDa protein 1B (HSPA1B)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>HSPA1B</th>\n",
" <td>3304</td>\n",
" <td>ENSG00000232804;ENSG00000204388;ENSG0000023155...</td>\n",
" <td>P0DMV8;P0DMV9</td>\n",
" <td>heat shock 70 kDa protein 1B (HSPA1B)</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">Hs.464391</th>\n",
" <th>TBCD</th>\n",
" <td>6904</td>\n",
" <td>ENSG00000278759;ENSG00000141556</td>\n",
" <td>Q9BTW9</td>\n",
" <td>tubulin-specific chaperone d (TBCD)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ZNF750</th>\n",
" <td>79755</td>\n",
" <td>ENSG00000141579</td>\n",
" <td>Q32MQ0</td>\n",
" <td>tubulin-specific chaperone d (TBCD)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" entrez_gene \\\n",
"Unigene symbol \n",
"Hs.23581 LEPROT 54741 \n",
" LEPR 3953 \n",
"Hs.274402 HSPA1A 3303 \n",
" HSPA1B 3304 \n",
"Hs.464391 TBCD 6904 \n",
" ZNF750 79755 \n",
"\n",
" ensembl_gene \\\n",
"Unigene symbol \n",
"Hs.23581 LEPROT ENSG00000213625 \n",
" LEPR ENSG00000116678 \n",
"Hs.274402 HSPA1A ENSG00000237724;ENSG00000204389;ENSG0000021532... \n",
" HSPA1B ENSG00000232804;ENSG00000204388;ENSG0000023155... \n",
"Hs.464391 TBCD ENSG00000278759;ENSG00000141556 \n",
" ZNF750 ENSG00000141579 \n",
"\n",
" uniprot Protein (Gene) \n",
"Unigene symbol \n",
"Hs.23581 LEPROT O15243 leptin receptor (LEPR) \n",
" LEPR P48357 leptin receptor (LEPR) \n",
"Hs.274402 HSPA1A P0DMV8;P0DMV9 heat shock 70 kDa protein 1B (HSPA1B) \n",
" HSPA1B P0DMV8;P0DMV9 heat shock 70 kDa protein 1B (HSPA1B) \n",
"Hs.464391 TBCD Q9BTW9 tubulin-specific chaperone d (TBCD) \n",
" ZNF750 Q32MQ0 tubulin-specific chaperone d (TBCD) "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mapping.loc[multihits].join(data['Protein (Gene)'])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"delete_hits = {\n",
" ('Hs.23581', 'LEPROT'),\n",
" ('Hs.274402', 'HSPA1A'),\n",
" ('Hs.464391', 'ZNF750')\n",
"}\n",
"\n",
"mapping = mapping.drop(delete_hits)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>entrez_gene</th>\n",
" <th>ensembl_gene</th>\n",
" <th>uniprot</th>\n",
" <th>Protein (Gene)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Unigene</th>\n",
" <th>symbol</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Hs.23581</th>\n",
" <th>LEPR</th>\n",
" <td>3953</td>\n",
" <td>ENSG00000116678</td>\n",
" <td>P48357</td>\n",
" <td>leptin receptor (LEPR)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hs.274402</th>\n",
" <th>HSPA1B</th>\n",
" <td>3304</td>\n",
" <td>ENSG00000232804;ENSG00000204388;ENSG0000023155...</td>\n",
" <td>P0DMV8;P0DMV9</td>\n",
" <td>heat shock 70 kDa protein 1B (HSPA1B)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hs.464391</th>\n",
" <th>TBCD</th>\n",
" <td>6904</td>\n",
" <td>ENSG00000278759;ENSG00000141556</td>\n",
" <td>Q9BTW9</td>\n",
" <td>tubulin-specific chaperone d (TBCD)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" entrez_gene \\\n",
"Unigene symbol \n",
"Hs.23581 LEPR 3953 \n",
"Hs.274402 HSPA1B 3304 \n",
"Hs.464391 TBCD 6904 \n",
"\n",
" ensembl_gene \\\n",
"Unigene symbol \n",
"Hs.23581 LEPR ENSG00000116678 \n",
"Hs.274402 HSPA1B ENSG00000232804;ENSG00000204388;ENSG0000023155... \n",
"Hs.464391 TBCD ENSG00000278759;ENSG00000141556 \n",
"\n",
" uniprot Protein (Gene) \n",
"Unigene symbol \n",
"Hs.23581 LEPR P48357 leptin receptor (LEPR) \n",
"Hs.274402 HSPA1B P0DMV8;P0DMV9 heat shock 70 kDa protein 1B (HSPA1B) \n",
"Hs.464391 TBCD Q9BTW9 tubulin-specific chaperone d (TBCD) "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mapping.loc[multihits].join(data['Protein (Gene)'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create augmented DF:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"augmented_df = pd.merge(data.reset_index(), \n",
" mapping.reset_index(),\n",
" left_on='Unigene', \n",
" right_on='Unigene',\n",
" how='left').set_index('Unigene')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Protein (Gene)</th>\n",
" <th>CHR</th>\n",
" <th>Ratio</th>\n",
" <th>symbol</th>\n",
" <th>entrez_gene</th>\n",
" <th>ensembl_gene</th>\n",
" <th>uniprot</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Unigene</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Hs.334370</th>\n",
" <td>Brain expressed, X-linked 1 (BEX1)</td>\n",
" <td>Xq22</td>\n",
" <td>5.55</td>\n",
" <td>BEX1</td>\n",
" <td>55859</td>\n",
" <td>ENSG00000133169</td>\n",
" <td>Q9HBH7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hs.690634</th>\n",
" <td>Heat shock 70 kDa protein 1-like (HSPA1L)</td>\n",
" <td>NaN</td>\n",
" <td>5.43</td>\n",
" <td>HSPA1L</td>\n",
" <td>3305</td>\n",
" <td>ENSG00000236251;ENSG00000226704;ENSG0000023425...</td>\n",
" <td>P34931</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hs.631926</th>\n",
" <td>Cadherin, EGF LAG seven-pass G-type receptor 3...</td>\n",
" <td>3p24.1-p21.2</td>\n",
" <td>4.92</td>\n",
" <td>CELSR3</td>\n",
" <td>1951</td>\n",
" <td>ENSG00000008300</td>\n",
" <td>Q9NYQ7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hs.465506</th>\n",
" <td>Phosphatidic acid phosphatase type 2C (PPAP2C)</td>\n",
" <td>19p13</td>\n",
" <td>4.14</td>\n",
" <td>PLPP2</td>\n",
" <td>8612</td>\n",
" <td>ENSG00000141934</td>\n",
" <td>O43688</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hs.432648</th>\n",
" <td>heat shock 70 kDa protein 2 (HSPA2)</td>\n",
" <td>14q24.1</td>\n",
" <td>3.97</td>\n",
" <td>HSPA2</td>\n",
" <td>3306</td>\n",
" <td>ENSG00000126803</td>\n",
" <td>P54652</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Protein (Gene) CHR \\\n",
"Unigene \n",
"Hs.334370 Brain expressed, X-linked 1 (BEX1) Xq22 \n",
"Hs.690634 Heat shock 70 kDa protein 1-like (HSPA1L) NaN \n",
"Hs.631926 Cadherin, EGF LAG seven-pass G-type receptor 3... 3p24.1-p21.2 \n",
"Hs.465506 Phosphatidic acid phosphatase type 2C (PPAP2C) 19p13 \n",
"Hs.432648 heat shock 70 kDa protein 2 (HSPA2) 14q24.1 \n",
"\n",
" Ratio symbol entrez_gene \\\n",
"Unigene \n",
"Hs.334370 5.55 BEX1 55859 \n",
"Hs.690634 5.43 HSPA1L 3305 \n",
"Hs.631926 4.92 CELSR3 1951 \n",
"Hs.465506 4.14 PLPP2 8612 \n",
"Hs.432648 3.97 HSPA2 3306 \n",
"\n",
" ensembl_gene uniprot \n",
"Unigene \n",
"Hs.334370 ENSG00000133169 Q9HBH7 \n",
"Hs.690634 ENSG00000236251;ENSG00000226704;ENSG0000023425... P34931 \n",
"Hs.631926 ENSG00000008300 Q9NYQ7 \n",
"Hs.465506 ENSG00000141934 O43688 \n",
"Hs.432648 ENSG00000126803 P54652 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"augmented_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"augmented_df.to_csv('liu-et-al-abbs-2009-table-3-augmented.tsv', sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-----\n",
"matplotlib 3.3.2\n",
"mygene 3.1.0\n",
"numpy 1.19.2\n",
"pandas 1.1.2\n",
"seaborn 0.11.0\n",
"sinfo 0.3.1\n",
"tqdm 4.50.0\n",
"-----\n",
"IPython 7.18.1\n",
"jupyter_client 6.1.7\n",
"jupyter_core 4.6.3\n",
"notebook 6.1.4\n",
"-----\n",
"Python 3.8.5 (default, Jul 21 2020, 10:42:08) [Clang 11.0.0 (clang-1100.0.33.17)]\n",
"macOS-10.14.6-x86_64-i386-64bit\n",
"8 logical CPU cores, i386\n",
"-----\n",
"Session information updated at 2020-10-05 20:29\n"
]
}
],
"source": [
"from sinfo import sinfo\n",
"sinfo(req_file_name='requirements.txt')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Unigene Protein (Gene) CHR Ratio symbol entrez_gene ensembl_gene uniprot
Hs.334370 Brain expressed, X-linked 1 (BEX1) Xq22 5.55 BEX1 55859 ENSG00000133169 Q9HBH7
Hs.690634 Heat shock 70 kDa protein 1-like (HSPA1L) 5.43 HSPA1L 3305 ENSG00000236251;ENSG00000226704;ENSG00000234258;ENSG00000206383;ENSG00000204390 P34931
Hs.631926 Cadherin, EGF LAG seven-pass G-type receptor 3, flamingoDrosophilahomolog (CELSR3) 3p24.1-p21.2 4.92 CELSR3 1951 ENSG00000008300 Q9NYQ7
Hs.465506 Phosphatidic acid phosphatase type 2C (PPAP2C) 19p13 4.14 PLPP2 8612 ENSG00000141934 O43688
Hs.432648 heat shock 70 kDa protein 2 (HSPA2) 14q24.1 3.97 HSPA2 3306 ENSG00000126803 P54652
Hs.69089 galactosidase, _ (GLA) Xq22 3.85 GLA 2717 ENSG00000102393 P06280
Hs.433750 eukaryotic translation initiation factor 4 _, 1 (EIF4G1) 3q27-qter 3.8 EIF4G1 1981 ENSG00000114867 Q04637
Hs.533628 KIAA0133 3.75 URB2 9816 ENSG00000135763 Q14146
Hs.279929 Transmembrane emp24 protein transport domain containing 9 (TMED9) 5q35.3 3.44 TMED9 54732 ENSG00000184840 Q9BVK6
Hs.516874 chromogranin B (CHGB) 20pter-p12 3.42 CHGB 1114 ENSG00000089199 P05060
Hs.74565 amyloid _ precursor-like protein 1 (APLP1) 19q13.1 3.4 APLP1 333 ENSG00000105290 P51693
Hs.650382 RAB5C, member RAS oncogene family (RAB5C) 17q21.2 3.28 RAB5C 5878 ENSG00000108774 P51148
Hs.23581 leptin receptor (LEPR) 1 3.25 LEPR 3953 ENSG00000116678 P48357
Hs.518403 Phosphatidylinositol glycan anchor biosynthesis class Z (PIGZ) 3q29 3.24 PIGZ 80235 ENSG00000119227 Q86VD9
Hs.313 secreted phosphoprotein 1 (SPP1) 4q21-q25 3.12 SPP1 6696 ENSG00000118785 P10451
Hs.232618 secretogranin III (SCG3) 15q21 2.87 SCG3 29106 ENSG00000104112 Q8WXD2
Hs.47166 Chromosome 3 open reading frame 14 (C3orf14) 3p14.3 2.86 C3orf14 57415 ENSG00000114405 Q9HBI5
Hs.514527 baculoviral IAP repeat-containing 5 (BIRC5) 17q25 2.86
Hs.530003 solute carrier family 2 member 5 (SLC2A5) 1p36.2 2.84 SLC2A5 6518 ENSG00000142583 P22732
Hs.496068 PCTAIRE protein kinase 1 (PCTK1) xp11.3-p11.23 2.84 CDK16 5127 ENSG00000102225 Q00536
Hs.632642 phosphoglycerate mutase 2 (muscle) (PGAM2) 7p13-p12 2.79 PGAM2 5224 ENSG00000164708 P15259
Hs.502842 calpain 1, (mu/I) large subunit (CAPN1) 11q13 2.71 CAPN1 823 ENSG00000014216 P07384
Hs.655168 Sideroflexin 4 (SFXN4) 10 2.61 SFXN4 119559 ENSG00000183605 Q6P4A7
Hs.645248 phosphate cytidylyltransferase 2 (PCYT2) 17q25.3 2.58
Hs.465985 arsA (bacterial) arsenite transporter homolog 1 (ASNA1) 19q13.3 2.51 GET3 439 ENSG00000198356 O43681
Hs.527295 ectonucleotide pyrophosphatase/phosphodiesterase 1 (ENPP1) 6q22-q23 2.5 ENPP1 5167 ENSG00000197594 P22413
Hs.5148 TRAF-type zinc finger domain containing 1 (TRAFD1) 12q 2.49 TRAFD1 10906 ENSG00000135148 O14545
Hs.194754 Chromosome 1 open reading frame 107 (C1orf107) 1q32.3-q41 2.42 UTP25 27042 ENSG00000117597 Q68CQ4
Hs.90303 tuberous sclerosis 2 (TSC2) 16p13.3 2.4 TSC2 7249 ENSG00000103197 P49815
Hs.631655 Leprecan-like 2 (LEPREL2) 12q13 2.35 P3H3 10536 ENSG00000110811 Q8IVL6
Hs.274402 heat shock 70 kDa protein 1B (HSPA1B) 6p21.3 2.34 HSPA1B 3304 ENSG00000232804;ENSG00000204388;ENSG00000231555;ENSG00000212866;ENSG00000224501 P0DMV8;P0DMV9
Hs.643454 dystrobrevin, _ #DTNA#, transcript variant DTN2 (DTNA) 18q12 2.34 DTNA 1837 ENSG00000134769 Q9Y4J8
Hs.500375 ectonucleoside triphosphate diphosphohydrolase 6 (putative) (ENTPD6) 2.33 ENTPD6 955 ENSG00000197586 O75354
Hs.500916 internexin neuronal intermediate filament protein, alpha (INA) 10q25.1 2.32 INA 9118 ENSG00000148798 Q16352
Hs.654401 IMP (inosine monophosphate) dehydrogenase 1 (IMPDH1) 7q31.3-q32 2.32 IMPDH1 3614 ENSG00000106348 P20839
Hs.631992 Dystonin (DST) 2.31
Hs.135270 collapsin response mediator protein 1 (CRMP1) 4p16.1-p15 2.3 CRMP1 1400 ENSG00000072832 Q14194
Hs.524947 CDC20 cell division cycle 20S. cerevisiaehomolog (CDC20) 9q13-q21 2.28 CDC20 991 ENSG00000117399 Q12834
Hs.651212 Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, _ polypeptide (YWHAB) 2.27
Hs.500197 RaP2 interacting protein 8 (RPIP8) 17q21.31 2.26 RUNDC3A 10900 ENSG00000108309 Q59EK9
Hs.533772 Meteorin, glial cell differentiation regulator (METRN) 16 2.24 METRN 79006 ENSG00000103260 Q9UJH8
Hs.576875 DEAD/H box polypeptide 21 (DDX21) 10q21 2.22
Hs.435255 UBX domain-containing 1 (UBXD1) 19p13 2.22 UBXN6 80700 ENSG00000167671 Q9BZV1
Hs.464391 tubulin-specific chaperone d (TBCD) 17 2.21 TBCD 6904 ENSG00000278759;ENSG00000141556 Q9BTW9
Hs.479439 BH-protocadherin (brain-heart) (PCDH7) 4p15 2.19 PCDH7 5099 ENSG00000169851 O60245
Hs.591234 gap junction protein, _2 (GJB2) 13q11-q12 2.18
Hs.528572 vinexin _ (SH3-containing adaptor molecule-1) (SORBS3) 8p21.2 2.17 SORBS3 10174 ENSG00000120896 O60504
Hs.464336 procollagen-proline, 2-oxoglutarate 4-dioxygenase(proline 4-hydroxylase), _ polypeptide (P4HB) 17q25 2.16 P4HB 5034 ENSG00000185624 P07237
Hs.9194 UBA domain containing 1 (UBAC1) 9q34.3 2.11 UBAC1 10422 ENSG00000130560 Q9BSL1
Hs.518414 Leucine-rich repeat and calponin homology domain containing 3 (LRCH3) 2.1 LRCH3 84859 ENSG00000186001 Q96II8
Hs.520028 heat shock 70 kDa protein 1A (HSPA1A) 6p21.3 2.09
Hs.808 heterogeneous nuclear ribonucleoprotein F (HNRPF) 2.08 HNRNPF 3185 ENSG00000169813 P52597
Hs.23413 ATAD3B220 sequences 1p36.32 2.08 ATAD3A 55210 ENSG00000197785 Q9NVI7
Hs.355934 splicing factor proline/glutamine rich 1pter-p32.3 2.06 SFPQ 6421 ENSG00000116560 P23246
Unigene Protein (Gene) CHR Ratio
Hs.334370 Brain expressed, X-linked 1 (BEX1) Xq22 5.55
Hs.690634 Heat shock 70 kDa protein 1-like (HSPA1L) 5.43
Hs.631926 Cadherin, EGF LAG seven-pass G-type receptor 3, flamingoDrosophilahomolog (CELSR3) 3p24.1-p21.2 4.92
Hs.465506 Phosphatidic acid phosphatase type 2C (PPAP2C) 19p13 4.14
Hs.432648 heat shock 70 kDa protein 2 (HSPA2) 14q24.1 3.97
Hs.69089 galactosidase, _ (GLA) Xq22 3.85
Hs.433750 eukaryotic translation initiation factor 4 _, 1 (EIF4G1) 3q27-qter 3.80
Hs.533628 KIAA0133 3.75
Hs.279929 Transmembrane emp24 protein transport domain containing 9 (TMED9) 5q35.3 3.44
Hs.516874 chromogranin B (CHGB) 20pter-p12 3.42
Hs.74565 amyloid _ precursor-like protein 1 (APLP1) 19q13.1 3.40
Hs.650382 RAB5C, member RAS oncogene family (RAB5C) 17q21.2 3.28
Hs.23581 leptin receptor (LEPR) 1 3.25
Hs.518403 Phosphatidylinositol glycan anchor biosynthesis class Z (PIGZ) 3q29 3.24
Hs.313 secreted phosphoprotein 1 (SPP1) 4q21-q25 3.12
Hs.232618 secretogranin III (SCG3) 15q21 2.87
Hs.47166 Chromosome 3 open reading frame 14 (C3orf14) 3p14.3 2.86
Hs.514527 baculoviral IAP repeat-containing 5 (BIRC5) 17q25 2.86
Hs.530003 solute carrier family 2 member 5 (SLC2A5) 1p36.2 2.84
Hs.496068 PCTAIRE protein kinase 1 (PCTK1) xp11.3-p11.23 2.84
Hs.632642 phosphoglycerate mutase 2 (muscle) (PGAM2) 7p13-p12 2.79
Hs.502842 calpain 1, (mu/I) large subunit (CAPN1) 11q13 2.71
Hs.655168 Sideroflexin 4 (SFXN4) 10 2.61
Hs.645248 phosphate cytidylyltransferase 2 (PCYT2) 17q25.3 2.58
Hs.465985 arsA (bacterial) arsenite transporter homolog 1 (ASNA1) 19q13.3 2.51
Hs.527295 ectonucleotide pyrophosphatase/phosphodiesterase 1 (ENPP1) 6q22-q23 2.50
Hs.5148 TRAF-type zinc finger domain containing 1 (TRAFD1) 12q 2.49
Hs.194754 Chromosome 1 open reading frame 107 (C1orf107) 1q32.3-q41 2.42
Hs.90303 tuberous sclerosis 2 (TSC2) 16p13.3 2.40
Hs.631655 Leprecan-like 2 (LEPREL2) 12q13 2.35
Hs.274402 heat shock 70 kDa protein 1B (HSPA1B) 6p21.3 2.34
Hs.643454 dystrobrevin, _ #DTNA#, transcript variant DTN2 (DTNA) 18q12 2.34
Hs.500375 ectonucleoside triphosphate diphosphohydrolase 6 (putative) (ENTPD6) 2.33
Hs.500916 internexin neuronal intermediate filament protein, alpha (INA) 10q25.1 2.32
Hs.654401 IMP (inosine monophosphate) dehydrogenase 1 (IMPDH1) 7q31.3-q32 2.32
Hs.631992 Dystonin (DST) 2.31
Hs.135270 collapsin response mediator protein 1 (CRMP1) 4p16.1-p15 2.30
Hs.524947 CDC20 cell division cycle 20S. cerevisiaehomolog (CDC20) 9q13-q21 2.28
Hs.651212 Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, _ polypeptide (YWHAB) 2.27
Hs.500197 RaP2 interacting protein 8 (RPIP8) 17q21.31 2.26
Hs.533772 Meteorin, glial cell differentiation regulator (METRN) 16 2.24
Hs.576875 DEAD/H box polypeptide 21 (DDX21) 10q21 2.22
Hs.435255 UBX domain-containing 1 (UBXD1) 19p13 2.22
Hs.464391 tubulin-specific chaperone d (TBCD) 17 2.21
Hs.479439 BH-protocadherin (brain-heart) (PCDH7) 4p15 2.19
Hs.591234 gap junction protein, _2 (GJB2) 13q11-q12 2.18
Hs.528572 vinexin _ (SH3-containing adaptor molecule-1) (SORBS3) 8p21.2 2.17
Hs.464336 procollagen-proline, 2-oxoglutarate 4-dioxygenase(proline 4-hydroxylase), _ polypeptide (P4HB) 17q25 2.16
Hs.9194 UBA domain containing 1 (UBAC1) 9q34.3 2.11
Hs.518414 Leucine-rich repeat and calponin homology domain containing 3 (LRCH3) 2.10
Hs.520028 heat shock 70 kDa protein 1A (HSPA1A) 6p21.3 2.09
Hs.808 heterogeneous nuclear ribonucleoprotein F (HNRPF) 2.08
Hs.23413 ATAD3B220 sequences 1p36.32 2.08
Hs.355934 splicing factor proline/glutamine rich 1pter-p32.3 2.06
matplotlib==3.3.2
mygene==3.1.0
numpy==1.19.2
pandas==1.1.2
seaborn==0.11.0
sinfo==0.3.1
tqdm==4.50.0
IPython==7.18.1
jupyter_client==6.1.7
jupyter_core==4.6.3
notebook==6.1.4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment