Created
May 4, 2009 07:54
-
-
Save egonw/106371 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 1. Get a copy of PubChem | |
| $ mkdir pubchem | |
| $ cd pubchem | |
| $ wget -r ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/XML/ -c | |
| $ ln -s ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/XML XML | |
| 2. Set up MySQL | |
| CREATE TABLE IF NOT EXISTS `compounds` ( | |
| `InChI` varchar(4096) NOT NULL, | |
| `CID` int(11) NOT NULL, | |
| `OrganicSubset` tinyint(1) NOT NULL, | |
| PRIMARY KEY (`CID`), | |
| KEY `InChI` (`InChI`(1000)) | |
| ) ENGINE=MyISAM DEFAULT CHARSET=latin1; | |
| 3. Set up Groovy, CDK and MySQL | |
| $ sudo aptitude install groovy | |
| $ export CLASSPATH=cdk-1.2.1.jar:mysql.jar | |
| and populate the database with structures: | |
| $ groovy populate.groovy | |
| You may want to tune the file matching algorithm to populate with the first 1M or 10M, by tuning the regular expression do filter all files in the XML folder: | |
| dir = new File("XML") | |
| def p = ~/Compound_0.*gz/ | |
| For the first 10M structures or ~/Compound_00.*gz/ for the first million, instead of ~/Compound.*gz/ for all SD files. | |
| 4. Atom Type perception | |
| Extra MySQL table: | |
| CREATE TABLE IF NOT EXISTS `atomtypeproblem` ( | |
| `atom` int(11) NOT NULL, | |
| `element` varchar(2) NOT NULL, | |
| `cid` int(11) NOT NULL, | |
| KEY `cid` (`cid`) | |
| ) ENGINE=MyISAM DEFAULT CHARSET=latin1; | |
| Run the analysis: | |
| $ groovy atomtyping.groovy | |
| Again, you may want to tune the file name pattern to match only the first 10M or so. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment