I had some days left on a physical machine we used for an EU FP7 research project so I took the chance to compare 3 triplestores (update: added some more based on comments here) I or my colleagues worked with in the past months. I do not want to imply anything with this test, it's just me playing around and having fun with RDF. If you have any comments, add it here.
The test platform comprises a dedicated server, not a virtual machine, with the following specification:
- 2 x Intel Xeon E5 2620V2, 2 x (6 x 2.10 GHz) (appears as 24 cores in
htop
) - 128 GB buffered ECC RAM
- 1000 GB SSD (Samsung 840 EVO)
- Ubuntu 14.04
The dataset contains 5 million triples (including some which are not valid RDF as "NA"
is declared as xsd:int
). It contains transports between entities and a date. To optimize query execution time for the particular use case, we want to infer/materialize (what's the right word here?) some triples so we don't have to go through all data all the time.
Source: (http://ktk.netlabs.org/misc/bfs/blv.nt) (622MB)
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix pobo: <http://purl.obolibrary.org/obo/> .
<http://foodsafety.data.admin.ch/move/0> a schema:TransferAction ;
schema:fromLocation <http://foodsafety.data.admin.ch/business/50454> ;
schema:toLocation <http://foodsafety.data.admin.ch/business/50415> ;
dc:date "2012-01-01"^^xsd:date ;
pobo:UO_0000189 "1"^^xsd:int .
There are around 900'000 TransferAction
in there. We torture the server with the following CONSTRUCT (well, INSERT) query:
PREFIX blv: <http://blv.ch/>
PREFIX schema: <http://schema.org/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
INSERT {
?othermove blv:notBefore ?move .
}
WHERE {
?move a schema:TransferAction ;
dc:date ?date ;
schema:toLocation ?toFarm .
?othermove a schema:TransferAction ;
dc:date ?otherdate ;
schema:fromLocation ?toFarm .
FILTER (?date <= ?otherdate)
}
After successful execution, I check how many triples were generated:
SELECT (COUNT(*) AS ?c) WHERE {?s <http://blv.ch/notBefore> ?o}
Which should be around 30 million triples.
Note that I did not do any optimization on the configurations. My idea was to take what vendors ship by default and see how long it takes. Because that's what users usually do ;)
- Homepage: http://virtuoso.openlinksw.com/
- Version: Virtuoso version 07.20.3215 on Linux (x86_64-unknown-linux-gnu), Single Server Edition
- Host: docker, image
tenforce/virtuoso
- Query execution time: 23 minutes
Loading RDF was fast, did it with iSQL according to the documentation of the Docker image. Virtuoso does not seem to use more than one core. During the whole execution time I had 100% load on one of the 24 cores, the rest did nothing.
- Homepage: http://stardog.com/
- Version: 4.0.5, Enterprise license (1 month trial key)
- Host: docker, image
java:latest
as there is no public docker image available. - Run: Default configuration started with
stardog-admin server start
- Query execution time: 4.00 minutes
Loading was fast, did it with stardog data add
on command line. I had the impression there is some query optimization going on. In the beginning there was not too much activity on the different cores. After a while the box became more busy and I saw quite some load on all cores. By far the fastest query execution time.
- Homepage: https://www.blazegraph.com/
- Version: 2.1.0
- Host: docker, image
java:latest
as there is no public docker image available. - Run:
java -server -Xmx8g -jar blazegraph.jar
- Query execution time: 33 minutes
I first used a docker image but didn't notice that this was the old 1.x version. I ran into a bug while executing the query on a 24 core machine and they asked me to retry with 2.x so make sure you use this as well as all docker images seem to be 1.x. Loading was fast, loaded it in the SPARQL UPDATE web interface from URI. Blazegraph was the most active on all cores, I basically had the whole time quite some load on them. I tried as well with 64GB or memory allocated to the VM but that was apparently not a bottleneck.
- Homepage: https://jena.apache.org/documentation/serving_data/
- Version: Version 2.0.1-SNAPSHOT
- Host: docker, image
stain/jena-fuseki
- Query execution time: TODO minutes
I started the docker image and loaded the data with tdbloader
into /fuseki/databases/blv
. After that I created a new database in the web interface which apparently didn't override the TDB store. The loading time is fast. While executing the query there is high load on all cores.
UPDATE 27.4.2016: I increased -xmx to 8GB and after around 6 hours I ran out of heap space. Not sure if we get anywhere without optimizing it (and I don't really know how).
- Homepage: http://ontotext.com/products/graphdb/
- Version: GraphDB Free 7.0
- Host: docker, image
java:latest
as there is no public docker image available. - Run:
~/graphdb-free-7.0.0/bin# ./graphdb
- Query execution time: 16 minutes
I created a new default store configuration, didn't change anything on the default settings regarding cache size etc. Loading via URL, loading was fast. I see load only on one core.
- Homepage: http://www.ontos.com/products/ontoquad/
- Version: 0.6.0
- Host: docker, built from
Dockerfile
found inontoquad-docker.txz
- Query execution time: 31 minutes (default config, polymorphic2)
- Query execution time: 14 minutes (polymorphic2, no transaction)
After consulting the documentation in Confluence I managed to upload the file as Triples which I copied into the docker image. Loading is fast. Default query execution timeout was too low, I could change it in the webinterface but I think it never got stored for some reason. So I changed it in the config file itself before I built the docker image. Same problem with transactions, disabled it in the config for the second round.
We took a look on our side and saw similar results for Blazegraph. We did have some optimizations in the query that ran much faster, but opened a ticket at BLZG-1902.