Created
October 1, 2021 19:19
-
-
Save ben-heil/cffbebf8865795fe2efbbfec041da969 to your computer and use it in GitHub Desktop.
Convert ENSEMBL stable identifiers to gene symbols
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import biomart | |
def get_ensembl_mappings(): | |
# Set up connection to server | |
server = biomart.BiomartServer('http://uswest.ensembl.org/biomart') | |
mart = server.datasets['mmusculus_gene_ensembl'] | |
# List the types of data we want | |
attributes = ['ensembl_transcript_id', 'mgi_symbol', | |
'ensembl_gene_id', 'ensembl_peptide_id'] | |
# Get the mapping between the attributes | |
response = mart.search({'attributes': attributes}) | |
data = response.raw.data.decode('ascii') | |
ensembl_to_genesymbol = {} | |
# Store the data in a dict | |
for line in data.splitlines(): | |
line = line.split('\t') | |
# The entries are in the same order as in the `attributes` variable | |
transcript_id = line[0] | |
gene_symbol = line[1] | |
ensembl_gene = line[2] | |
ensembl_peptide = line[3] | |
# Some of these keys may be an empty string. If you want, you can | |
# avoid having a '' key in your dict by ensuring the | |
# transcript/gene/peptide ids have a nonzero length before | |
# adding them to the dict | |
ensembl_to_genesymbol[transcript_id] = gene_symbol | |
ensembl_to_genesymbol[ensembl_gene] = gene_symbol | |
ensembl_to_genesymbol[ensembl_peptide] = gene_symbol | |
return ensembl_to_genesymbol |
Thanks a lot!
It works
Vic
Glad to hear it!
Hi Ben,
I would need to convert Entrez ID mouse into gene symbol mouse. Could I use this function changing parameters?
Best
Vic
I haven't tried it, but it should be possible! The most straightforward way to do so would be to add 'entrezgene_id'
to the end of the attributes list and convert the ensembl_to_genesymbol
lines to map entrez to genesymbol e.g.
entrez_id = line[4]
entrez_to_genesymbol[entrez_id] = gene_symbol
More information on the available attributes can be found here: https://bioconductor.riken.jp/packages/3.4/bioc/vignettes/biomaRt/inst/doc/biomaRt.html
Great! I will try.
Best
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi Victor, since
get_ensembl_mappings
returns a dict, I think the correct pandas function to use would bemap
. Without having looked at your data, I thinkdf['gene_id']=df['gene_id'].map(get_ensembl_mappings())
should work (just be careful to handle the NaN values for the ids that don't map).