Notes on implementation language environments, execution performance and machine considerations; based on the software technical supplement to "Uncovering disease-disease relationships through the incomplete interactome" Science Magazine Feb 2015
19Apr15 dbb v0.2
Observations
-
supplied python code ran quickly on all compute environments (~2 seconds)
-
non-trivial transform of a python networkx interactome requires 9GB of physical RAM
-
neo4j interactive environment is mature and visually pleasing
-
neo4j queries use a single core only
-
neo4j visualization breaks down at a few hundred nodes, even on the Xeon E5
Three compute environments (described below) were used for various incarnations of the problem, as supplied in Python 2.7x and supporting data files, subsequently ported to neo4j. Initial Run:
-
create 'agilesde' account on i7d
-
ensure python 2.7x
networkx
,numpy
libs are present -
copy the example data files and code to server
-
execute sample code (first run)
## two programs are run; the first is an anlalysis of a single disease profile
## the reference interactome; the second uses two disease profiles and executes analysis
agilesde@i7d:~/Documents/saleh_pkg/source$ python localization.py -n interactome.tsv -g PD.txt \
-o output.txt
> default network from "interactome.tsv" will be used
> done loading network:
> network contains 13460 nodes and 141296 links
> done reading genes:
> 20 genes found in PD.txt
> lcc size = 3
> mean shortest distance = 1.05
> random simulation [1000 of 1000]
> gene set from "PD.txt": 20 genes
> lcc size S = 3
> diameter d_s = 1.05
> Random expectation:
> lcc [rand] = 19.671
> => z-score of observed lcc = -27.0169651597
> results have been saved to output.txt
#----------------------------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ ./separation.py -n interactome.tsv --g1 MS.txt --g2 PD.txt \
-o output.txt
> default network from "interactome.tsv" will be used
> done loading network:
> network contains 13460 nodes and 141296 links
> done reading genes:
> 108 genes found in MS.txt
> ignoring 39 genes that are not in the network
> remaining number of genes: 69
> done reading genes:
> 20 genes found in PD.txt
> gene set A from "MS.txt": 69 genes, network-diameter d_A = 1.85507246377
> gene set B from "PD.txt": 20 genes, network-diameter d_B = 1.05
> mean shortest distance between A & B: d_AB = 2.73033707865
> network separation of A & B: s_AB = 1.27780084677
> results have been saved to output.txt
-
pick a new gene set from
data/DataS2_disease_genes.tsv
-
make an input file in the expanded form of one gene number per line
ovarian neoplasms (20 genes total, 14 from OMIM, 6 from GWAS)
python localization.py -n interactome.tsv -g ovarian_neoplasms.txt -o on_0.txt
> default network from "interactome.tsv" will be used
> done loading network:
> network contains 13460 nodes and 141296 links
> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17
> lcc size = 10
> mean shortest distance = 1.70588235294
> random simulation [1000 of 1000]
> gene set from "ovarian_neoplasms.txt": 17 genes
> lcc size S = 10
> diameter d_s = 1.70588235294
> Random expectation:
> lcc [rand] = 16.831
> => z-score of observed lcc = -15.9928345284
> results have been saved to on_0.txt
##-----------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ python separation.py -n interactome.tsv \
--g1 ovarian_neoplasms.txt --g2 PD.txt -o on_PD0.txt
> default network from "interactome.tsv" will be used
> done loading network:
> network contains 13460 nodes and 141296 links
> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17
> done reading genes:
> 20 genes found in PD.txt
> gene set A from "ovarian_neoplasms.txt": 17 genes, network-diameter d_A = 1.70588235294
> gene set B from "PD.txt": 20 genes, network-diameter d_B = 1.05
> mean shortest distance between A & B: d_AB = 2.59459459459
> network separation of A & B: s_AB = 1.21665341812
> results have been saved to on_PD0.txt
#--------------------------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ python separation.py -n interactome.tsv \
--g1 ovarian_neoplasms.txt --g2 MS.txt -o on_PD0.txt
> default network from "interactome.tsv" will be used
> done loading network:
> network contains 13460 nodes and 141296 links
> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17
> done reading genes:
> 108 genes found in MS.txt
> ignoring 39 genes that are not in the network
> remaining number of genes: 69
> gene set A from "ovarian_neoplasms.txt": 17 genes, network-diameter d_A = 1.70588235294
> gene set B from "MS.txt": 69 genes, network-diameter d_B = 1.85507246377
> mean shortest distance between A & B: d_AB = 2.13953488372
> network separation of A & B: s_AB = 0.359057475366
> results have been saved to on_PD0.txt
IPython is an interactive command-line environment for python programming. It has recently been extended to multi-node, network execution with a very straightforward clustering model, re-branded as Jupyter Hub. The interactome code was ported easily into the Jupyter environment for execution.
A unique benefit of the IPython environment is inline graphics, both library-generated such as Matplotlib, and html-friendly formats, such as png, jpeg, video and iframes.
(see IPython session udd_graph_ex0.html )
Import the interactome csv definition into a postgres table, for convenient count, uniques, sub-select and new csv generation.
DROP TABLE interactom_test0;
CREATE TABLE interactom_test0
(
node_a integer,
node_b integer,
desc_orig text
);
Agile=# copy (select distinct(node_b) from interactom_test0) to
'/Users/Shared/chalice_review_assets/neo4j_import_work/dist_b.csv';
Neo4j Community Edition 2.2.1; Oracle Java 1.7_0_79; Mac OSX 10.10.3
10GB ← dbms.pagecache.memory
-
Start / Authenticate
-
Import Data
-
Indexes / Query
-
Backup
conf/neo4j-server.properties
/ org.neo4j.server.database.location=data/import0.db
$ bin/neo4j start
CSV Data file formats
node_id:ID,attra
1,1
10,10
100,100
1000,1000
10000,10000
10001,10001
10002,10002
100049587,100049587
10005,10005
:START_ID,:END_ID,:TYPE
100290337,4214,INTERA
122704,54460,INTERA
4790,79155,INTERA
2597,70,INTERA
5923,7157,INTERA
509,6122,INTERA
4067,933,INTERA
398,998,INTERA
1748,5976,INTERA
1537,55967,INTERA
10989,54927,INTERA
55890,7920,INTERA
6629,9140,INTERA
$ bin/neo4j-import calvisitor-10-105-155-98:neo4j-community-2.2.1 Agile$ bin/neo4j-import --into data/import0.db --id-type INTEGER \ > --nodes /Users/Shared/chalice_review_assets/neo4j_import_work/nodes_cmb_uniq.csv \ > --relationships:INTERA /Users/Shared/chalice_review_assets/neo4j_import_work/rels_cmb.csv Nodes [>:??---------------------|PROPERTIES---|*NODE:7.63 MB--------------|v:??----------------------] 20k Done in 421ms Prepare node index [*DETECT:11.44 MB------------------------------------------------------------------------------] 10k Done in 80ms Calculate dense nodes [>:??------------------------|*PREPARE----------------------------------------------|CALCULATOR]200k Done in 225ms Relationships [>:??---|*PREPARE----------------------|RELATIONSHIP------------------|v:??--------------------]150k Done in 243ms Node --> Relationship [*>:??------------------------------------------------|LINK------------------------------------] 20k Done in 34ms Relationship --> Relationship [>:??-----------------------|*LINK------------------------------------------------|v:??--------]150k Done in 88ms Node counts [*COUNT:0.00 B---------------------------------------------------------------------------------] 20k Done in 12ms Relationship counts [*COUNT----------------------------------------------------------------------------------------]150k Done in 36ms IMPORT DONE in 1s 939ms
START n=node(*) MATCH (n)-[r]->(m) WHERE n.node_id < 22 RETURN n,r,m;