Skip to content

Instantly share code, notes, and snippets.

@darkblue-b
Last active August 29, 2015 14:19
Show Gist options
  • Save darkblue-b/eeafb784953a8e312f86 to your computer and use it in GitHub Desktop.
Save darkblue-b/eeafb784953a8e312f86 to your computer and use it in GitHub Desktop.
asciidoc test0

Search and Matching / Gene Sequences

Notes on implementation language environments, execution performance and machine considerations; based on the software technical supplement to "Uncovering disease-disease relationships through the incomplete interactome" Science Magazine Feb 2015

19Apr15 dbb v0.2

Observations

  • supplied python code ran quickly on all compute environments (~2 seconds)

  • non-trivial transform of a python networkx interactome requires 9GB of physical RAM

  • neo4j interactive environment is mature and visually pleasing

  • neo4j queries use a single core only

  • neo4j visualization breaks down at a few hundred nodes, even on the Xeon E5

Initial Supplement Code Execution

Three compute environments (described below) were used for various incarnations of the problem, as supplied in Python 2.7x and supporting data files, subsequently ported to neo4j. Initial Run:

  • create 'agilesde' account on i7d

  • ensure python 2.7x networkx, numpy libs are present

  • copy the example data files and code to server

  • execute sample code (first run)

## two programs are run; the first is an anlalysis of a single disease profile
##  the reference interactome; the second uses two disease profiles and executes analysis

agilesde@i7d:~/Documents/saleh_pkg/source$ python localization.py -n interactome.tsv -g PD.txt \
    -o output.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in PD.txt

> lcc size = 3
> mean shortest distance = 1.05

> random simulation [1000 of 1000]
> gene set from "PD.txt": 20 genes
> lcc size   S = 3
> diameter d_s = 1.05

> Random expectation:
> lcc [rand] = 19.671
> => z-score of observed lcc = -27.0169651597

> results have been saved to output.txt


#----------------------------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ ./separation.py -n interactome.tsv --g1 MS.txt --g2 PD.txt \
    -o output.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 108 genes found in MS.txt
> ignoring 39 genes that are not in the network
> remaining number of genes: 69

> done reading genes:
> 20 genes found in PD.txt

> gene set A from "MS.txt": 69 genes, network-diameter d_A = 1.85507246377
> gene set B from "PD.txt": 20 genes, network-diameter d_B = 1.05
> mean shortest distance between A & B: d_AB = 2.73033707865
> network separation of A & B:          s_AB = 1.27780084677

> results have been saved to output.txt

Show an example of another disease input

  • pick a new gene set from data/DataS2_disease_genes.tsv

  • make an input file in the expanded form of one gene number per line

ovarian neoplasms (20 genes total, 14 from OMIM, 6 from GWAS)

python localization.py -n interactome.tsv -g ovarian_neoplasms.txt -o on_0.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17

> lcc size = 10
> mean shortest distance = 1.70588235294

> random simulation [1000 of 1000]
> gene set from "ovarian_neoplasms.txt": 17 genes
> lcc size   S = 10
> diameter d_s = 1.70588235294

> Random expectation:
> lcc [rand] = 16.831
> => z-score of observed lcc = -15.9928345284

> results have been saved to on_0.txt

##-----------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ python separation.py -n interactome.tsv \
    --g1 ovarian_neoplasms.txt --g2 PD.txt -o on_PD0.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17

> done reading genes:
> 20 genes found in PD.txt

> gene set A from "ovarian_neoplasms.txt": 17 genes, network-diameter d_A = 1.70588235294
> gene set B from "PD.txt": 20 genes, network-diameter d_B = 1.05
> mean shortest distance between A & B: d_AB = 2.59459459459
> network separation of A & B:          s_AB = 1.21665341812

> results have been saved to on_PD0.txt

#--------------------------------------------------
agilesde@i7d:~/Documents/saleh_pkg/source$ python separation.py -n interactome.tsv \
    --g1 ovarian_neoplasms.txt --g2 MS.txt -o on_PD0.txt

> default network from "interactome.tsv" will be used

> done loading network:
> network contains 13460 nodes and 141296 links

> done reading genes:
> 20 genes found in ovarian_neoplasms.txt
> ignoring 3 genes that are not in the network
> remaining number of genes: 17

> done reading genes:
> 108 genes found in MS.txt
> ignoring 39 genes that are not in the network
> remaining number of genes: 69

> gene set A from "ovarian_neoplasms.txt": 17 genes, network-diameter d_A = 1.70588235294
> gene set B from "MS.txt": 69 genes, network-diameter d_B = 1.85507246377
> mean shortest distance between A & B: d_AB = 2.13953488372
> network separation of A & B:          s_AB = 0.359057475366

> results have been saved to on_PD0.txt

IPython / Jupyter Hub

IPython is an interactive command-line environment for python programming. It has recently been extended to multi-node, network execution with a very straightforward clustering model, re-branded as Jupyter Hub. The interactome code was ported easily into the Jupyter environment for execution.

A unique benefit of the IPython environment is inline graphics, both library-generated such as Matplotlib, and html-friendly formats, such as png, jpeg, video and iframes.

(see IPython session udd_graph_ex0.html )

PostgreSQL Import

Import the interactome csv definition into a postgres table, for convenient count, uniques, sub-select and new csv generation.

09 04 2013 postgresql 2013 01 28
DROP TABLE interactom_test0;
CREATE TABLE interactom_test0
(
  node_a integer,
  node_b integer,
  desc_orig text
);

Agile=# copy (select distinct(node_b) from interactom_test0) to
  '/Users/Shared/chalice_review_assets/neo4j_import_work/dist_b.csv';

Neo4j // Import and Execution

Neo4j Community Edition 2.2.1; Oracle Java 1.7_0_79; Mac OSX 10.10.3

10GB ← dbms.pagecache.memory

  1. Start / Authenticate

  2. Import Data

  3. Indexes / Query

  4. Backup

conf/neo4j-server.properties / org.neo4j.server.database.location=data/import0.db

$ bin/neo4j start

CSV Data file formats

nodes_cmb_uniq.csv
node_id:ID,attra
1,1
10,10
100,100
1000,1000
10000,10000
10001,10001
10002,10002
100049587,100049587
10005,10005
rels_comb.csv
:START_ID,:END_ID,:TYPE
100290337,4214,INTERA
122704,54460,INTERA
4790,79155,INTERA
2597,70,INTERA
5923,7157,INTERA
509,6122,INTERA
4067,933,INTERA
398,998,INTERA
1748,5976,INTERA
1537,55967,INTERA
10989,54927,INTERA
55890,7920,INTERA
6629,9140,INTERA
$ bin/neo4j-import

calvisitor-10-105-155-98:neo4j-community-2.2.1 Agile$ bin/neo4j-import --into data/import0.db  --id-type INTEGER \
> --nodes /Users/Shared/chalice_review_assets/neo4j_import_work/nodes_cmb_uniq.csv \
> --relationships:INTERA /Users/Shared/chalice_review_assets/neo4j_import_work/rels_cmb.csv
Nodes
[>:??---------------------|PROPERTIES---|*NODE:7.63 MB--------------|v:??----------------------] 20k
Done in 421ms
Prepare node index
[*DETECT:11.44 MB------------------------------------------------------------------------------] 10k
Done in 80ms
Calculate dense nodes
[>:??------------------------|*PREPARE----------------------------------------------|CALCULATOR]200k
Done in 225ms
Relationships
[>:??---|*PREPARE----------------------|RELATIONSHIP------------------|v:??--------------------]150k
Done in 243ms
Node --> Relationship
[*>:??------------------------------------------------|LINK------------------------------------] 20k
Done in 34ms
Relationship --> Relationship
[>:??-----------------------|*LINK------------------------------------------------|v:??--------]150k
Done in 88ms
Node counts
[*COUNT:0.00 B---------------------------------------------------------------------------------] 20k
Done in 12ms
Relationship counts
[*COUNT----------------------------------------------------------------------------------------]150k
Done in 36ms

IMPORT DONE in 1s 939ms

Snapshot of neo4j live environment

START n=node(*) MATCH (n)-[r]->(m) WHERE n.node_id < 22 RETURN n,r,m;
neo4j interactome import1

Compute Hardware

AgileSDE MacPro Bullet

macpro bullet sm
6-core Xeon E5
16 GB RAM @ 1866 MHz
SSD 256GB

i7d

computer icon sm
8-core i7-960
16 GB RAM @ 2000 MHz
2TB Western Digital black label

MacPro laptop 2007

laptop icon sm
2-core Intel Core-duo
4 GB RAM @ 667 MHz
500GB Western Digital black label
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment