Search and Matching / Gene Sequences

Notes on implementation language environments, execution performance and machine considerations; based on the software technical supplement to "Uncovering disease-disease relationships through the incomplete interactome" Science Magazine Feb 2015

19Apr15 dbb _v0.2

Observations

supplied python code ran quickly on all compute environments (~2 seconds)
non-trivial transform of a python networkx interactome requires 9GB of physical RAM
neo4j interactive environment is mature and visually pleasing
neo4j queries use a single core only
neo4j visualization breaks down at a few hundred nodes, even on the Xeon E5

Initial Supplement Code Execution

Three compute environments (described below) were used for various incarnations of the problem, as supplied in Python 2.7x and supporting data files, subsequently ported to neo4j. Initial Run:

create 'agilesde' account on i7d
ensure python 2.7x networkx, numpy libs are present
copy the example data files and code to server
execute sample code (first run)

## two programs are run; the first is an anlalysis of a single disease profile
##  the reference interactome; the second uses two disease profiles and executes analysis

agilesde@i7d:~/Documents/saleh_pkg/source$ python localization.py -n interactome.tsv -g PD.txt \
    -o output.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in PD.txt

> lcc size = 3
> mean shortest distance = 1.05

> random simulation [1000 of 1000]
> gene set from "PD.txt": 20 genes
> lcc size   S = 3
> diameter d_s = 1.05

> Random expectation:
> lcc [rand] = 19.671
> => z-score of observed lcc = -27.0169651597

> results have been saved to output.txt


#----------------------------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ ./separation.py -n interactome.tsv --g1 MS.txt --g2 PD.txt \
    -o output.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 108 genes found in MS.txt
> ignoring 39 genes that are not in the network
> remaining number of genes: 69

> done reading genes:
> 20 genes found in PD.txt

> gene set A from "MS.txt": 69 genes, network-diameter d_A = 1.85507246377
> gene set B from "PD.txt": 20 genes, network-diameter d_B = 1.05
> mean shortest distance between A & B: d_AB = 2.73033707865
> network separation of A & B:          s_AB = 1.27780084677

> results have been saved to output.txt

Show an example of another disease input

pick a new gene set from data/DataS2_disease_genes.tsv
make an input file in the expanded form of one gene number per line

ovarian neoplasms (20 genes total, 14 from OMIM, 6 from GWAS)

python localization.py -n interactome.tsv -g ovarian_neoplasms.txt -o on_0.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17

> lcc size = 10
> mean shortest distance = 1.70588235294

> random simulation [1000 of 1000]
> gene set from "ovarian_neoplasms.txt": 17 genes
> lcc size   S = 10
> diameter d_s = 1.70588235294

> Random expectation:
> lcc [rand] = 16.831
> => z-score of observed lcc = -15.9928345284

> results have been saved to on_0.txt

##-----------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ python separation.py -n interactome.tsv \
    --g1 ovarian_neoplasms.txt --g2 PD.txt -o on_PD0.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17

> done reading genes:
> 20 genes found in PD.txt

> gene set A from "ovarian_neoplasms.txt": 17 genes, network-diameter d_A = 1.70588235294
> gene set B from "PD.txt": 20 genes, network-diameter d_B = 1.05
> mean shortest distance between A & B: d_AB = 2.59459459459
> network separation of A & B:          s_AB = 1.21665341812

> results have been saved to on_PD0.txt

#--------------------------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ python separation.py -n interactome.tsv \
    --g1 ovarian_neoplasms.txt --g2 MS.txt -o on_PD0.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17

> done reading genes:
> 108 genes found in MS.txt
> ignoring 39 genes that are not in the network
> remaining number of genes: 69

> gene set A from "ovarian_neoplasms.txt": 17 genes, network-diameter d_A = 1.70588235294
> gene set B from "MS.txt": 69 genes, network-diameter d_B = 1.85507246377
> mean shortest distance between A & B: d_AB = 2.13953488372
> network separation of A & B:          s_AB = 0.359057475366

> results have been saved to on_PD0.txt

IPython / Jupyter Hub

IPython is an interactive command-line environment for python programming. It has recently been extended to multi-node, network execution with a very straightforward clustering model, re-branded as Jupyter Hub. The interactome code was ported easily into the Jupyter environment for execution.

A unique benefit of the IPython environment is inline graphics, both library-generated such as Matplotlib, and html-friendly formats, such as png, jpeg, video and iframes.

(see IPython session udd_graph_ex0.html )

PostgreSQL Import

Import the interactome csv definition into a postgres table, for convenient count, uniques, sub-select and new csv generation.

DROP TABLE interactom_test0;
CREATE TABLE interactom_test0
(
  node_a integer,
  node_b integer,
  desc_orig text
);

Agile=# copy (select distinct(node_b) from interactom_test0) to
  '/Users/Shared/chalice_review_assets/neo4j_import_work/dist_b.csv';

Neo4j // Import and Execution

Neo4j Community Edition 2.2.1; Oracle Java 1.7_0_79; Mac OSX 10.10.3

10GB ← dbms.pagecache.memory

Start / Authenticate
Import Data
Indexes / Query
Backup

conf/neo4j-server.properties / org.neo4j.server.database.location=data/import0.db

$ bin/neo4j start

CSV Data file formats

nodes_cmb_uniq.csv

node_id:ID,attra
1,1
10,10
100,100
1000,1000
10000,10000
10001,10001
10002,10002
100049587,100049587
10005,10005

rels_comb.csv

:START_ID,:END_ID,:TYPE
100290337,4214,INTERA
122704,54460,INTERA
4790,79155,INTERA
2597,70,INTERA
5923,7157,INTERA
509,6122,INTERA
4067,933,INTERA
398,998,INTERA
1748,5976,INTERA
1537,55967,INTERA
10989,54927,INTERA
55890,7920,INTERA
6629,9140,INTERA

$ bin/neo4j-import

calvisitor-10-105-155-98:neo4j-community-2.2.1 Agile$ bin/neo4j-import --into data/import0.db  --id-type INTEGER \
> --nodes /Users/Shared/chalice_review_assets/neo4j_import_work/nodes_cmb_uniq.csv \
> --relationships:INTERA /Users/Shared/chalice_review_assets/neo4j_import_work/rels_cmb.csv
Nodes
[>:??---------------------|PROPERTIES---|*NODE:7.63 MB--------------|v:??----------------------] 20k
Done in 421ms
Prepare node index
[*DETECT:11.44 MB------------------------------------------------------------------------------] 10k
Done in 80ms
Calculate dense nodes
[>:??------------------------|*PREPARE----------------------------------------------|CALCULATOR]200k
Done in 225ms
Relationships
[>:??---|*PREPARE----------------------|RELATIONSHIP------------------|v:??--------------------]150k
Done in 243ms
Node --> Relationship
[*>:??------------------------------------------------|LINK------------------------------------] 20k
Done in 34ms
Relationship --> Relationship
[>:??-----------------------|*LINK------------------------------------------------|v:??--------]150k
Done in 88ms
Node counts
[*COUNT:0.00 B---------------------------------------------------------------------------------] 20k
Done in 12ms
Relationship counts
[*COUNT----------------------------------------------------------------------------------------]150k
Done in 36ms

IMPORT DONE in 1s 939ms

Snapshot of neo4j live environment

START n=node(*) MATCH (n)-[r]->(m) WHERE n.node_id < 22 RETURN n,r,m;

Compute Hardware

AgileSDE MacPro Bullet

6-core Xeon E5
16 GB RAM @ 1866 MHz
SSD 256GB

i7d

8-core i7-960
16 GB RAM @ 2000 MHz
2TB Western Digital black label

MacPro laptop 2007

2-core Intel Core-duo
4 GB RAM @ 667 MHz
500GB Western Digital black label

darkblue-b/f0.adoc

Search and Matching / Gene Sequences

Initial Supplement Code Execution

Show an example of another disease input

IPython / Jupyter Hub

PostgreSQL Import

Neo4j // Import and Execution

Snapshot of neo4j live environment

Compute Hardware

References - neo4j

References - neo4j Minimal Import

References - python and networkx

References - AsciiDoc and GraphGist