Skip to content

Instantly share code, notes, and snippets.

@ravila4
Forked from newgene/id_mapping_mygene.ipynb
Created November 16, 2020 16:04
Show Gist options
  • Save ravila4/346e040eb758f6e1e6579d325274b2ae to your computer and use it in GitHub Desktop.
Save ravila4/346e040eb758f6e1e6579d325274b2ae to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "ID mapping using MyGene.info "
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "ID mapping using mygene module in Python"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "ID mapping is a very common, often not fun, task for every bioinformatician. Supposedly you have a list of gene symbols or reporter ids from an upstream analysis, and then your next analysis requires to use gene ids (e.g. Entrez gene ids or Ensembl gene ids). So you want to convert that list of gene symbols or reporter ids to corresponding gene ids.\n\nHere we want to show you how to use **mygene** module in Python to do ID mapping quickly and easily. **mygene** is essentially a convenient Python module to access [MyGene.info](http://mygene.info) gene query web services."
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Install **mygene**"
},
{
"cell_type": "markdown",
"metadata": {},
"source": " Install **mygene** is easy, as *pip* is your friend:\n\n pip install mygene\n\n Now you just need to import it and instantiate MyGeneInfo class:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import mygene\n\nmg = mygene.MyGeneInfo()\n",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Mapping gene symbols to Entrez gene ids"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Suppose **xli** is a list of gene symbols you want to convert to entrez gene ids:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "xli = ['DDX26B',\n 'CCDC83',\n 'MAST3',\n 'FLOT1',\n 'RPL11',\n 'ZDHHC20',\n 'LUC7L3',\n 'SNORD49A',\n 'CTSH',\n 'ACOT8']",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": " you can then call **querymany** method, telling it your input is \"symbol\", and you want \"entrezgene\" (Entrez gene ids) back."
},
{
"cell_type": "code",
"collapsed": false,
"input": "out = mg.querymany(xli, scopes='symbol', fields='entrezgene', species='human')",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Finished.\n"
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Returned \"**out**\" looks like this:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "out",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": "[{u'_id': u'203522', u'entrezgene': 203522, u'query': u'DDX26B'},\n {u'_id': u'220047', u'entrezgene': 220047, u'query': u'CCDC83'},\n {u'_id': u'23031', u'entrezgene': 23031, u'query': u'MAST3'},\n {u'_id': u'10211', u'entrezgene': 10211, u'query': u'FLOT1'},\n {u'_id': u'6135', u'entrezgene': 6135, u'query': u'RPL11'},\n {u'_id': u'253832', u'entrezgene': 253832, u'query': u'ZDHHC20'},\n {u'_id': u'51747', u'entrezgene': 51747, u'query': u'LUC7L3'},\n {u'_id': u'26800', u'entrezgene': 26800, u'query': u'SNORD49A'},\n {u'_id': u'1512', u'entrezgene': 1512, u'query': u'CTSH'},\n {u'_id': u'10005', u'entrezgene': 10005, u'query': u'ACOT8'}]"
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The mapping result is returned as a list of dictionaries. Each dictionary contains the **fields** you asked to return, in this case, \"entrezgene\" field. Each dictionary also returns the matching query term, \"**query**\", and an internal id, \"**_id**\", which is the same as \"**entrezgene**\" most of time (will be an ensembl gene id if a gene is available from Ensembl only)."
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Mapping gene symbols to Ensembl gene ids"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now if you want Ensembl gene ids back:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "mg.querymany(xli, scopes='symbol', fields='ensembl.gene', species='human')",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Finished.\n"
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": "[{u'_id': u'203522',\n u'ensembl.gene': [u'ENSG00000165359', u'ENSG00000268630'],\n u'query': u'DDX26B'},\n {u'_id': u'220047', u'ensembl.gene': u'ENSG00000150676', u'query': u'CCDC83'},\n {u'_id': u'23031', u'ensembl.gene': u'ENSG00000099308', u'query': u'MAST3'},\n {u'_id': u'10211',\n u'ensembl.gene': [u'ENSG00000137312',\n u'ENSG00000206379',\n u'ENSG00000206480',\n u'ENSG00000223654',\n u'ENSG00000224740',\n u'ENSG00000230143',\n u'ENSG00000232280',\n u'ENSG00000236271'],\n u'query': u'FLOT1'},\n {u'_id': u'6135', u'ensembl.gene': u'ENSG00000142676', u'query': u'RPL11'},\n {u'_id': u'253832',\n u'ensembl.gene': u'ENSG00000180776',\n u'query': u'ZDHHC20'},\n {u'_id': u'51747', u'ensembl.gene': u'ENSG00000108848', u'query': u'LUC7L3'},\n {u'_id': u'26800',\n u'ensembl.gene': [u'ENSG00000206956', u'ENSG00000175061'],\n u'query': u'SNORD49A'},\n {u'_id': u'1512', u'ensembl.gene': u'ENSG00000103811', u'query': u'CTSH'},\n {u'_id': u'10005', u'ensembl.gene': u'ENSG00000101473', u'query': u'ACOT8'}]"
}
],
"prompt_number": 5
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "When an input id has no matching gene"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In case that an input id has no matching gene, you will be notified from the output.The returned dictionary for this query term contains \"notfound\" value as *True*."
},
{
"cell_type": "code",
"collapsed": false,
"input": "xli = ['DDX26B',\n 'CCDC83',\n 'MAST3',\n 'FLOT1',\n 'RPL11',\n 'Gm10494']",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": "mg.querymany(xli, scopes='symbol', fields='entrezgene', species='human')",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Finished.\n1 input query terms found no hit:\n\t[u'Gm10494']\nPass \"returnall=True\" to return complete lists of duplicate or missing query terms.\n"
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": "[{u'_id': u'203522', u'entrezgene': 203522, u'query': u'DDX26B'},\n {u'_id': u'220047', u'entrezgene': 220047, u'query': u'CCDC83'},\n {u'_id': u'23031', u'entrezgene': 23031, u'query': u'MAST3'},\n {u'_id': u'10211', u'entrezgene': 10211, u'query': u'FLOT1'},\n {u'_id': u'6135', u'entrezgene': 6135, u'query': u'RPL11'},\n {u'notfound': True, u'query': u'Gm10494'}]"
}
],
"prompt_number": 7
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "When input ids are not just symbols"
},
{
"cell_type": "code",
"collapsed": false,
"input": "xli = ['DDX26B',\n 'CCDC83',\n 'MAST3',\n 'FLOT1',\n 'RPL11',\n 'Gm10494',\n '1007_s_at',\n 'AK125780']",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Above id list contains symbols, reporters and accession numbers, and supposedly we want to get back both Entrez gene ids and uniprot ids. Parameters **scopes**, **fields**, **species** are all flexible enough to support multiple values, either a list or a comma-separated string:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "mg.querymany(xli, scopes='symbol,reporter,accession', fields='entrezgene,uniprot', species='human')",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Finished.\n1 input query terms found dup hits:\n\t[(u'1007_s_at', 2)]\n1 input query terms found no hit:\n\t[u'Gm10494']\nPass \"returnall=True\" to return complete lists of duplicate or missing query terms.\n"
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": "[{u'_id': u'203522',\n u'entrezgene': 203522,\n u'query': u'DDX26B',\n u'uniprot': {u'Swiss-Prot': u'Q5JSJ4'}},\n {u'_id': u'220047',\n u'entrezgene': 220047,\n u'query': u'CCDC83',\n u'uniprot': {u'Swiss-Prot': u'Q8IWF9'}},\n {u'_id': u'23031',\n u'entrezgene': 23031,\n u'query': u'MAST3',\n u'uniprot': {u'Swiss-Prot': u'O60307'}},\n {u'_id': u'10211',\n u'entrezgene': 10211,\n u'query': u'FLOT1',\n u'uniprot': {u'Swiss-Prot': u'O75955', u'TrEMBL': u'Q5ST80'}},\n {u'_id': u'6135',\n u'entrezgene': 6135,\n u'query': u'RPL11',\n u'uniprot': {u'Swiss-Prot': u'P62913', u'TrEMBL': u'Q5VVD0'}},\n {u'notfound': True, u'query': u'Gm10494'},\n {u'_id': u'100616237', u'entrezgene': 100616237, u'query': u'1007_s_at'},\n {u'_id': u'780',\n u'entrezgene': 780,\n u'query': u'1007_s_at',\n u'uniprot': {u'Swiss-Prot': u'Q08345', u'TrEMBL': [u'Q96T61', u'Q96T62']}},\n {u'_id': u'2978',\n u'entrezgene': 2978,\n u'query': u'AK125780',\n u'uniprot': {u'Swiss-Prot': u'P43080'}}]"
}
],
"prompt_number": 9
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "When a input id has multiple matching genes"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "From the previous result, you may have noticed that query term \"1007_s_at\" matches two genes. In that case, you will be notified from the output, and the returned result will include both matching genes.\n\nBy passing \"returnall=True\", you will get both duplicate or missing query terms, together with the mapping output, from the returned result:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "mg.querymany(xli, scopes='symbol,reporter,accession', fields='entrezgene,uniprot', species='human', returnall=True)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Finished.\n1 input query terms found dup hits:\n\t[(u'1007_s_at', 2)]\n1 input query terms found no hit:\n\t[u'Gm10494']\n"
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": "{'dup': [(u'1007_s_at', 2)],\n 'missing': [u'Gm10494'],\n 'out': [{u'_id': u'203522',\n u'entrezgene': 203522,\n u'query': u'DDX26B',\n u'uniprot': {u'Swiss-Prot': u'Q5JSJ4'}},\n {u'_id': u'220047',\n u'entrezgene': 220047,\n u'query': u'CCDC83',\n u'uniprot': {u'Swiss-Prot': u'Q8IWF9'}},\n {u'_id': u'23031',\n u'entrezgene': 23031,\n u'query': u'MAST3',\n u'uniprot': {u'Swiss-Prot': u'O60307'}},\n {u'_id': u'10211',\n u'entrezgene': 10211,\n u'query': u'FLOT1',\n u'uniprot': {u'Swiss-Prot': u'O75955', u'TrEMBL': u'Q5ST80'}},\n {u'_id': u'6135',\n u'entrezgene': 6135,\n u'query': u'RPL11',\n u'uniprot': {u'Swiss-Prot': u'P62913', u'TrEMBL': u'Q5VVD0'}},\n {u'notfound': True, u'query': u'Gm10494'},\n {u'_id': u'100616237', u'entrezgene': 100616237, u'query': u'1007_s_at'},\n {u'_id': u'780',\n u'entrezgene': 780,\n u'query': u'1007_s_at',\n u'uniprot': {u'Swiss-Prot': u'Q08345', u'TrEMBL': [u'Q96T61', u'Q96T62']}},\n {u'_id': u'2978',\n u'entrezgene': 2978,\n u'query': u'AK125780',\n u'uniprot': {u'Swiss-Prot': u'P43080'}}]}"
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The returned result above contains \"**out**\" for mapping output, \"**missing**\" for missing query terms (a list), and \"**dup**\" for query terms with multiple matches (including the number of matches)."
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Can I convert a very large list of ids?"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Yes, you can. If you pass an id list (i.e., **xli** above) larger than 1000 ids, we will do the id mapping in-batch with 1000 ids at a time, and then concatenate the results all together for you. So, from the user-end, it's exactly the same as passing a shorter list. You don't need to worry about saturating our backend servers."
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "To read more"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "* [MyGene.info](http://mygene.info)\n * [Batch queries via POST](http://mygene.info/doc/query_service.html#batch-queries-via-post)\n * [**scopes** parameter](http://mygene.info/doc/query_service.html#scopes)\n * [**fields** parameter](http://mygene.info/doc/query_service.html#fields)\n * [**species** parameter](http://mygene.info/doc/query_service.html#species)\n* [mygene module](https://crate.io/packages/mygene/)"
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment