Skip to content

Instantly share code, notes, and snippets.

@egonw
Created May 4, 2009 07:54
Show Gist options
  • Select an option

  • Save egonw/106371 to your computer and use it in GitHub Desktop.

Select an option

Save egonw/106371 to your computer and use it in GitHub Desktop.
1. Get a copy of PubChem
$ mkdir pubchem
$ cd pubchem
$ wget -r ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/XML/ -c
$ ln -s ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/XML XML
2. Set up MySQL
CREATE TABLE IF NOT EXISTS `compounds` (
`InChI` varchar(4096) NOT NULL,
`CID` int(11) NOT NULL,
`OrganicSubset` tinyint(1) NOT NULL,
PRIMARY KEY (`CID`),
KEY `InChI` (`InChI`(1000))
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
3. Set up Groovy, CDK and MySQL
$ sudo aptitude install groovy
$ export CLASSPATH=cdk-1.2.1.jar:mysql.jar
and populate the database with structures:
$ groovy populate.groovy
You may want to tune the file matching algorithm to populate with the first 1M or 10M, by tuning the regular expression do filter all files in the XML folder:
dir = new File("XML")
def p = ~/Compound_0.*gz/
For the first 10M structures or ~/Compound_00.*gz/ for the first million, instead of ~/Compound.*gz/ for all SD files.
4. Atom Type perception
Extra MySQL table:
CREATE TABLE IF NOT EXISTS `atomtypeproblem` (
`atom` int(11) NOT NULL,
`element` varchar(2) NOT NULL,
`cid` int(11) NOT NULL,
KEY `cid` (`cid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Run the analysis:
$ groovy atomtyping.groovy
Again, you may want to tune the file name pattern to match only the first 10M or so.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment