The "proper" way to do a "cable2graph" free-form text search would be a 3-stage process:
- Find the relevant MRNs
- Genrate a neighborhood graph for those MRNs (
nbh
) - Create an html+svg from that graph (
g2svg
)
If we want to make a cable2graph web service that responds fast enough and doesn't use too many resources, the main concern would be stage 3.
This is a proof of concept for a "search lite" process that may not yield the best possible results, but from experience - what you get is not too shabby, and it can be fast and light enough to run as a web service.
There's a one time preprocessing stage (preprocess.sh
) where we split the full graph to cluster graphml files (splitgraph
) and generate html files for all of them (g2svg
).
We then create a json file mapping each MRN to the graphs that contain it (make-mrn2graphs.py
).
The search itself (search-subgraphs.sh
) is a 2 stage process:
- Find the relevent MRNs (rMRNs) [*]
- Find graphs that contain any rMRNs, and sort them by number rMRNs they contain. This is quite fast, since each MRN is mapped to the graphs containing it, and all we need to do is prepare a histogram.
The result is an html file with links to the relevant html files, sorted by relevance.
example-result.html
is a result of a search for "wataniya",
and here's the best-matching subgraph (22 occurences)
All files here (except for example-result.html
- of course) should reside in the cable2graph folder.
Files generated by these scripts (including full.graphml
) will be in $OUTDIR
(i.e ./out
) or subdirectories of it.
Enjoy.
[*] This should probably be done with something like Solr, but since I was too lazy to do that, I used the cablesearch API (cablesearch2mrns.py
).