Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save MikeDacre/7531d4e02052cbb8e906a084ad68e4e7 to your computer and use it in GitHub Desktop.
Save MikeDacre/7531d4e02052cbb8e906a084ad68e4e7 to your computer and use it in GitHub Desktop.
Use SQLAlchemy with the UCSC Genome Database
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"cell_type": "markdown",
"source": "# Use SQLAlchemy with the UCSC Database\n\n\nWe are going to use SQLAlchemy first to pull a list of tables, and then to create an ORM of the table we care about, in this case SNP147.\n\nTo do this we will be makign use of SQLAlchemy's [autmapping functionality](http://docs.sqlalchemy.org/en/latest/orm/extensions/automap.html), which creates ORM object directly from tables. Unfortunately, it fails with a number of USCS tables because they don't have detectable primary keys, which are required for the ORM. Because of this we will first inspect the database and dump a list of tables, and then explicity declare a class for the table we care about. SQLAlchemy will fill in all extra columns for us, so we only need to explicitly declare the class and primary key, everything else is done for us."
},
{
"metadata": {
"trusted": true,
"collapsed": true,
"ExecuteTime": {
"start_time": "2017-02-09T11:05:39.244869",
"end_time": "2017-02-09T11:05:39.414740"
}
},
"cell_type": "code",
"source": "import pandas as pd\nfrom sqlalchemy import inspect\nfrom sqlalchemy import create_engine\nfrom sqlalchemy import Column, Integer, String\nfrom sqlalchemy.orm import Session\nfrom sqlalchemy.sql import select\nfrom sqlalchemy.ext.automap import automap_base",
"execution_count": 1,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true,
"ExecuteTime": {
"start_time": "2017-02-09T11:05:39.416023",
"end_time": "2017-02-09T11:05:39.446926"
}
},
"cell_type": "code",
"source": "# Connect to the hg19 database\nengine = create_engine(\"mysql+pymysql://[email protected]/{organism}?charset=utf8mb4\".format(organism='hg19'))",
"execution_count": 2,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Get table"
},
{
"metadata": {
"trusted": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:05:39.448257",
"end_time": "2017-02-09T11:05:43.089667"
}
},
"cell_type": "code",
"source": "# Get the list of tables\ninspector = inspect(engine)\ntables = inspector.get_table_names()",
"execution_count": 3,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This list is *very* long. You can just look through it, but we are going to filter it for tables that begin with 'snp'"
},
{
"metadata": {
"trusted": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:05:43.090849",
"end_time": "2017-02-09T11:05:43.106050"
}
},
"cell_type": "code",
"source": "len(tables)",
"execution_count": 4,
"outputs": [
{
"metadata": {},
"execution_count": 4,
"output_type": "execute_result",
"data": {
"text/plain": "11048"
}
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": true,
"ExecuteTime": {
"start_time": "2017-02-09T11:05:43.108442",
"end_time": "2017-02-09T11:05:43.130086"
}
},
"cell_type": "code",
"source": "snp_tables = [i for i in tables if i.startswith('snp')]",
"execution_count": 5,
"outputs": []
},
{
"metadata": {
"scrolled": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:05:43.131714",
"end_time": "2017-02-09T11:05:43.147156"
},
"trusted": true
},
"cell_type": "code",
"source": "sorted(snp_tables)",
"execution_count": 6,
"outputs": [
{
"metadata": {},
"execution_count": 6,
"output_type": "execute_result",
"data": {
"text/plain": "['snp138',\n 'snp138CodingDbSnp',\n 'snp138Common',\n 'snp138ExceptionDesc',\n 'snp138Flagged',\n 'snp138Mult',\n 'snp138OrthoPt4Pa2Rm3',\n 'snp138Seq',\n 'snp141',\n 'snp141CodingDbSnp',\n 'snp141Common',\n 'snp141ExceptionDesc',\n 'snp141Flagged',\n 'snp141OrthoPt4Pa2Rm3',\n 'snp141Seq',\n 'snp142',\n 'snp142CodingDbSnp',\n 'snp142Common',\n 'snp142ExceptionDesc',\n 'snp142Flagged',\n 'snp142Mult',\n 'snp142OrthoPt4Pa2Rm3',\n 'snp142Seq',\n 'snp144',\n 'snp144CodingDbSnp',\n 'snp144Common',\n 'snp144ExceptionDesc',\n 'snp144Flagged',\n 'snp144Mult',\n 'snp144OrthoPt4Pa2Rm3',\n 'snp144Seq',\n 'snp146',\n 'snp146CodingDbSnp',\n 'snp146Common',\n 'snp146ExceptionDesc',\n 'snp146Flagged',\n 'snp146Mult',\n 'snp146OrthoPt4Pa2Rm3',\n 'snp146Seq',\n 'snp147',\n 'snp147CodingDbSnp',\n 'snp147Common',\n 'snp147ExceptionDesc',\n 'snp147Flagged',\n 'snp147Mult',\n 'snp147OrthoPt4Pa2Rm3',\n 'snp147Seq',\n 'snpArrayAffy250Nsp',\n 'snpArrayAffy250Sty',\n 'snpArrayAffy5',\n 'snpArrayAffy6',\n 'snpArrayAffy6SV',\n 'snpArrayIllumina1M',\n 'snpArrayIllumina1MRaw',\n 'snpArrayIllumina300',\n 'snpArrayIllumina550',\n 'snpArrayIllumina650',\n 'snpArrayIlluminaHuman660W_Quad',\n 'snpArrayIlluminaHuman660W_QuadRaw',\n 'snpArrayIlluminaHumanCytoSNP_12',\n 'snpArrayIlluminaHumanCytoSNP_12Raw',\n 'snpArrayIlluminaHumanOmni1_Quad',\n 'snpArrayIlluminaHumanOmni1_QuadRaw']"
}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "OK, so let's use snp147."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Get Columns and Build ORM\n\nWe are now going to make the ORM, so let's check the actual columns of the table."
},
{
"metadata": {
"scrolled": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:05:43.148524",
"end_time": "2017-02-09T11:05:43.193024"
},
"trusted": true
},
"cell_type": "code",
"source": "inspector.get_columns('snp147')",
"execution_count": 7,
"outputs": [
{
"metadata": {},
"execution_count": 7,
"output_type": "execute_result",
"data": {
"text/plain": "[{'autoincrement': False,\n 'default': None,\n 'name': 'bin',\n 'nullable': False,\n 'type': SMALLINT(display_width=5, unsigned=True)},\n {'default': None,\n 'name': 'chrom',\n 'nullable': False,\n 'type': VARCHAR(length=31)},\n {'autoincrement': False,\n 'default': None,\n 'name': 'chromStart',\n 'nullable': False,\n 'type': INTEGER(display_width=10, unsigned=True)},\n {'autoincrement': False,\n 'default': None,\n 'name': 'chromEnd',\n 'nullable': False,\n 'type': INTEGER(display_width=10, unsigned=True)},\n {'default': None,\n 'name': 'name',\n 'nullable': False,\n 'type': VARCHAR(length=15)},\n {'autoincrement': False,\n 'default': None,\n 'name': 'score',\n 'nullable': False,\n 'type': SMALLINT(display_width=5, unsigned=True)},\n {'default': None,\n 'name': 'strand',\n 'nullable': False,\n 'type': ENUM('+', '-')},\n {'default': None, 'name': 'refNCBI', 'nullable': False, 'type': BLOB()},\n {'default': None, 'name': 'refUCSC', 'nullable': False, 'type': BLOB()},\n {'default': None,\n 'name': 'observed',\n 'nullable': False,\n 'type': VARCHAR(length=255)},\n {'default': None,\n 'name': 'molType',\n 'nullable': False,\n 'type': ENUM('unknown', 'genomic', 'cDNA')},\n {'default': None,\n 'name': 'class',\n 'nullable': False,\n 'type': ENUM('single', 'in-del', 'microsatellite', 'named', 'mnp', 'insertion', 'deletion')},\n {'default': None, 'name': 'valid', 'nullable': False, 'type': SET(length=15)},\n {'default': None, 'name': 'avHet', 'nullable': False, 'type': FLOAT()},\n {'default': None, 'name': 'avHetSE', 'nullable': False, 'type': FLOAT()},\n {'default': None, 'name': 'func', 'nullable': False, 'type': SET(length=14)},\n {'default': None,\n 'name': 'locType',\n 'nullable': False,\n 'type': ENUM('range', 'exact', 'between', 'rangeInsertion', 'rangeSubstitution', 'rangeDeletion', 'fuzzy')},\n {'autoincrement': False,\n 'default': None,\n 'name': 'weight',\n 'nullable': False,\n 'type': INTEGER(display_width=10, unsigned=True)},\n {'default': None,\n 'name': 'exceptions',\n 'nullable': False,\n 'type': SET(length=26)},\n {'autoincrement': False,\n 'default': None,\n 'name': 'submitterCount',\n 'nullable': False,\n 'type': SMALLINT(display_width=5, unsigned=True)},\n {'default': None,\n 'name': 'submitters',\n 'nullable': False,\n 'type': LONGBLOB()},\n {'autoincrement': False,\n 'default': None,\n 'name': 'alleleFreqCount',\n 'nullable': False,\n 'type': SMALLINT(display_width=5, unsigned=True)},\n {'default': None, 'name': 'alleles', 'nullable': False, 'type': LONGBLOB()},\n {'default': None, 'name': 'alleleNs', 'nullable': False, 'type': LONGBLOB()},\n {'default': None,\n 'name': 'alleleFreqs',\n 'nullable': False,\n 'type': LONGBLOB()},\n {'default': None,\n 'name': 'bitfields',\n 'nullable': False,\n 'type': SET(length=33)}]"
}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**There are a lot of columns, but the vast majority will automap just fine, so we don't need to do anything. However, we do need to add a primary key as none exists here.**\n\nI actually just want to look up the SNP name by the position, so I am going to explicitly declare a primary key (since I don't care about their primary key and **some UCSC tables have primary keys that are not dectected by the automapper**) and also declare the columns I care about. The table will end up with all columns, but I want to guarantee access to the columns that I care about."
},
{
"metadata": {
"trusted": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:05:43.194288",
"end_time": "2017-02-09T11:07:02.806559"
}
},
"cell_type": "code",
"source": "# The automap base will detect all tables in the database and create classes for\n# as many as it can. Many UCSC tables don't become classes because the primary key\n# cannot be detected.\nBase = automap_base()\n\nclass SNP147(Base):\n __tablename__ = 'snp147'\n \n name = Column(String(length=15), primary_key=True, nullable=False)\n \n # The following columns do not need to be declared, the automapper will do it for\n # us. I map them anyway for my own personal reference.\n chrom = Column(String(length=31), nullable=False)\n chromStart = Column(Integer, nullable=False)\n chromEnd = Column(Integer, nullable=False)\n\n# reflect the tables\nBase.prepare(engine, reflect=True)",
"execution_count": 8,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true,
"ExecuteTime": {
"start_time": "2017-02-09T11:07:02.808131",
"end_time": "2017-02-09T11:07:02.821514"
}
},
"cell_type": "code",
"source": "session = Session(engine)",
"execution_count": 9,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Query the table"
},
{
"metadata": {
"trusted": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:07:02.822862",
"end_time": "2017-02-09T11:07:06.590669"
}
},
"cell_type": "code",
"source": "session.query(SNP147.name).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd == 16952481).first()",
"execution_count": 10,
"outputs": [
{
"metadata": {},
"execution_count": 10,
"output_type": "execute_result",
"data": {
"text/plain": "('rs915311')"
}
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": true,
"ExecuteTime": {
"start_time": "2017-02-09T11:07:06.592264",
"end_time": "2017-02-09T11:07:06.606709"
}
},
"cell_type": "code",
"source": "positions = [\n 154326279,\n 11029552,\n 241803636,\n 59165838,\n 39991588,\n 204733046,\n 16341354,\n 16971948,\n 154056834,\n 9712006\n]",
"execution_count": 11,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:07:06.608088",
"end_time": "2017-02-09T11:07:06.622075"
}
},
"cell_type": "code",
"source": "q = session.query(SNP147.name).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd.in_(positions))",
"execution_count": 12,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:07:06.623564",
"end_time": "2017-02-09T11:07:52.822993"
}
},
"cell_type": "code",
"source": "q.all()",
"execution_count": 13,
"outputs": [
{
"metadata": {},
"execution_count": 13,
"output_type": "execute_result",
"data": {
"text/plain": "[('rs78800005'),\n ('rs74052099'),\n ('rs1048245'),\n ('rs2209174'),\n ('rs3738676'),\n ('rs11207297'),\n ('rs72694300'),\n ('rs4040747'),\n ('rs12034642'),\n ('rs7513575')]"
}
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:07:52.826283",
"end_time": "2017-02-09T11:08:39.215029"
}
},
"cell_type": "code",
"source": "df = pd.read_sql_query(q.statement, engine)",
"execution_count": 14,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:08:39.216478",
"end_time": "2017-02-09T11:08:39.234590"
}
},
"cell_type": "code",
"source": "df",
"execution_count": 15,
"outputs": [
{
"metadata": {},
"execution_count": 15,
"output_type": "execute_result",
"data": {
"text/html": "<div>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>name</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>rs78800005</td>\n </tr>\n <tr>\n <th>1</th>\n <td>rs74052099</td>\n </tr>\n <tr>\n <th>2</th>\n <td>rs1048245</td>\n </tr>\n <tr>\n <th>3</th>\n <td>rs2209174</td>\n </tr>\n <tr>\n <th>4</th>\n <td>rs3738676</td>\n </tr>\n <tr>\n <th>5</th>\n <td>rs11207297</td>\n </tr>\n <tr>\n <th>6</th>\n <td>rs72694300</td>\n </tr>\n <tr>\n <th>7</th>\n <td>rs4040747</td>\n </tr>\n <tr>\n <th>8</th>\n <td>rs12034642</td>\n </tr>\n <tr>\n <th>9</th>\n <td>rs7513575</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " name\n0 rs78800005\n1 rs74052099\n2 rs1048245\n3 rs2209174\n4 rs3738676\n5 rs11207297\n6 rs72694300\n7 rs4040747\n8 rs12034642\n9 rs7513575"
}
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": true,
"ExecuteTime": {
"start_time": "2017-02-09T11:08:39.236182",
"end_time": "2017-02-09T11:09:25.690168"
}
},
"cell_type": "code",
"source": "df2 = pd.read_sql_query(\n session.query(SNP147.name, SNP147.chrom, SNP147.chromEnd).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd.in_(positions)).statement,\n engine\n)",
"execution_count": 16,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false,
"ExecuteTime": {
"start_time": "2017-02-09T11:09:25.692398",
"end_time": "2017-02-09T11:09:25.723277"
}
},
"cell_type": "code",
"source": "df2",
"execution_count": 17,
"outputs": [
{
"metadata": {},
"execution_count": 17,
"output_type": "execute_result",
"data": {
"text/html": "<div>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>name</th>\n <th>chrom</th>\n <th>chromEnd</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>rs78800005</td>\n <td>chr1</td>\n <td>9712006</td>\n </tr>\n <tr>\n <th>1</th>\n <td>rs74052099</td>\n <td>chr1</td>\n <td>11029552</td>\n </tr>\n <tr>\n <th>2</th>\n <td>rs1048245</td>\n <td>chr1</td>\n <td>16341354</td>\n </tr>\n <tr>\n <th>3</th>\n <td>rs2209174</td>\n <td>chr1</td>\n <td>16971948</td>\n </tr>\n <tr>\n <th>4</th>\n <td>rs3738676</td>\n <td>chr1</td>\n <td>39991588</td>\n </tr>\n <tr>\n <th>5</th>\n <td>rs11207297</td>\n <td>chr1</td>\n <td>59165838</td>\n </tr>\n <tr>\n <th>6</th>\n <td>rs72694300</td>\n <td>chr1</td>\n <td>154056834</td>\n </tr>\n <tr>\n <th>7</th>\n <td>rs4040747</td>\n <td>chr1</td>\n <td>154326279</td>\n </tr>\n <tr>\n <th>8</th>\n <td>rs12034642</td>\n <td>chr1</td>\n <td>204733046</td>\n </tr>\n <tr>\n <th>9</th>\n <td>rs7513575</td>\n <td>chr1</td>\n <td>241803636</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " name chrom chromEnd\n0 rs78800005 chr1 9712006\n1 rs74052099 chr1 11029552\n2 rs1048245 chr1 16341354\n3 rs2209174 chr1 16971948\n4 rs3738676 chr1 39991588\n5 rs11207297 chr1 59165838\n6 rs72694300 chr1 154056834\n7 rs4040747 chr1 154326279\n8 rs12034642 chr1 204733046\n9 rs7513575 chr1 241803636"
}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "That's it, simple! Yes, you can just query the MySQL directly, but this approach is much more robust for on-the-fly data analysis. It is easy to get SQL syntax wrong, but using SQLAlchemy in Jupyter is trivial."
}
],
"metadata": {
"toc": {
"threshold": 4,
"number_sections": true,
"toc_cell": false,
"toc_window_display": false,
"toc_section_display": "block",
"sideBar": true,
"navigate_menu": true,
"nav_menu": {
"width": "252px",
"height": "102px"
}
},
"anaconda-cloud": {},
"language_info": {
"name": "python",
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py",
"version": "3.5.2",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"mimetype": "text/x-python"
},
"kernelspec": {
"name": "conda-root-py",
"display_name": "Python [conda root]",
"language": "python"
},
"gist": {
"id": "",
"data": {
"description": "Use SQLAlchemy with the UCSC Genome Database",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment