Review: ISWC 2017 Resources Submission 167

URI: https://gist.github.com/stain/ce1f8884986d1109429f4cd0ed22c2d4
Title: Norwegian State of Estate Report as Linked Open Data
Authors: Ling Shi, Dina Sukhobok, Nikolay Nikolov and Dumitru Roman
Call: ISWC 2017 Resources Track
Submitted preprint: TODO
Resource: https://datahub.io/dataset/norwegiansoe https://doi.org/10.5281/zenodo.818088
Review by: Stian Soiland-Reyes (#4 of 4)
Outcome: Reject for ISWC2017 Resource Track
- Accepted as ISWC 2017 Demo
  
  Shi, L., Pettersen, B. E., Sukhobok, D., Nikolov N., and Roman, D. (2017): Linked Data for the Norwegian State of Estate Reporting Service. International Semantic Web Conference. Demo paper. [pdf] [preprint]
- Revised & extended paper was later accepted at ODBASE 2017 - The 16th International Conference on Ontologies, DataBases, and Applications of Semantic as https://doi.org/10.1007/978-3-319-69459-7_30 -- related blog
  
  Shi, D. Sukhobok, N. Nikolov and D. Roman. Norwegian State of Estate Report as Linked Open Data. To appear in the proceedings of ODBASE 2017 – The 16th International Conference on Ontologies, DataBases, and Applications of Semantics, Springer, 24-25 October 2017, Rhodes, Greece. https://doi.org/10.1007/978-3-319-69459-7_30 [preprint]

This review is licensed under a Creative Commons Attribution 4.0 International License.

Evaluation

Reviewer's confidence: 3: high

Appropriateness: 1: good

Clarity and quality of writing: 2: very good

Related work: 0: sufficient

Originality: 1: good

Impact of ideas and results: 1: good

Implementaton and soundness: 2: very good

Evaluation: 2: very good

Assessment of the resource

Assessment of the resource: 2: very good

I retrieved the datadump with wget https://rdf.datagraft.net/4035596353/db/repositories/norwegian-state-of-estate-report-6/statements - which with content negotiation retrieved NTriples. (I see the browser will download RDF/XML which I didn't test).

I would have expected also availability of an archived datadump with a VOID file and license information, this seems to me a bit more fragile - can the dataset change at rdf.datagraft.net or is "-6" the version number indicating it is fixed? Versioning strategy must be specific if people are going to be using this dataset.

I loaded the triples into a new dataset in a fresh Apache Jena Fuseki 2.6.0 instance to inspect it and test the SPARQL queries - there were no issues in loading.

I also tested the provided SPARQL endpoint through Fuseki's web UI, which gave equivalent results.

It was a bit confusing to with moving between DataHub, DataGraft and various landing pages for the queries, as not much descriptions were shown on each of these pages. This presents a barrier of entry to new users of the dataset.

Reusability

Reusability: 2: very good (was: 1: good)

Although the paper uses DataHub at https://datahub.io/dataset/norwegiansoe the actual data is not at DataHub, but linked to the author's https://datagraft.net/ service -- while this is available now we don't know its future beyond the prodatamarket H2020 project.

Some of the documentation linked to from datahub.io is hosted from someone's Dropbox account, which sounds very fragile and not very persistent.

Therefore I would ask for a versioned RDF datadumps to also be archived on a third-party site - e.g. Zenodo - this should include versioning, provenance and usage documentation.

The authors have been diligent in providing example queries and queries used during construction of the dataset. This should be commended. I tested the queries in my own Jena instance which gave equivalent results.

The listing of queries is in a PDF https://www.dropbox.com/s/18pho0bpybwa721/SPARQLQueriesForLinkedDataGeneartion.pdf?dl=0 which have links back to the datagraft.net service - so again if this service is gone we no longer know the queries. For longevity I would be happier if these example queries were provided as files inside a GitHub repository or Zenodo archive.

I was unable to view any of the data transformations at datagraft - e.g. https://datagraft.io/prodatamarket_publisher/transformations/the-cadastral-parcel-ownership-transformation - in Chrome I got a blank frame, while Firefox gives a "You are being redirected" link to the sign-in page. After I registered and signed in to Datagraft I was able to view them. Can these be made fully public for inspection?

The license is not specified, while the paper claim this is "Norwegian License for Open Government Data" (why a custom license?) - this is however not expressed on https://datahub.io/dataset/norwegiansoe which lacks even minimal metadata "License Other (Open)", "Created: unknown". On datagraft.io I get "No license specified". I would require the license information in both the landing pages and in the download.

The downloaded dataset do not embedded VOID or provenance information, and so we don't formally know who made it, when or how. So FAIR principles are not properly followed here.

Edit: Thanks for the improvements, including a VOID description and putting the RDF data dump and documentation at https://doi.org/10.5281/zenodo.818088 (links to dump from datahub should use the DOI, also add this DOI as citation in the updated paper).

I wonder why the OWL ontology and SPARQL queries cannot also be preserved as files in Zenodo, rather than a PDF hoping that links to https://datagraft.io/ pages - will this web service remains available in perpuiety?

I have changed Reusability score from "1 good", to "2: very good".

Resource design quality

Resource Design Quality: 1: good

The dataset links to DBPedia, Geonames and Lenke.no, and reuses top-level ontologies like DUL as well as schema.org and dbpedia.

The URIs in the dataset, e.g. http://www.datagraft.net/prodm/RealRights/021437200969159570 are not resolvable. (404 Not Found) I would expect these to at least redirect to a Datagraft landing page for the corresponding dataset, and ideally work with content negotiation to the RDF statements.

The vocabularies are well presented, CC-BY licensed and documented at http://vocabs.datagraft.net/ - I am not so sure why the vocabulary is so split up, some of these have only two classes - however this paper submits the dataset rather than the vocabulary so that is out of scope for this review.

The vocabulary namespaces include the version number "0.1" - this sounds fragile as it means the dataset and queries would need to be updated for any changes to the vocabulary. I would expect only the "major" version number to be included in the namespace URL.

http://eubusinessgraph.eu/vocabs/ is 404 Not Found - is this a third-party vocabulary?

Again vocabs.datagraft.net is a self-hosted service, I would prefer if these vocabularies were also archived in a third-party repository for longevity.

I have highlighted that dataset metadata and license information is missing, as well as there not being a downloadable datadump archive from a durable third-party repository. This makes it hard for other researchers to rely on the dataset. If this is fixed I would change my overall evaluation from "weak accept" to "strong accept".

Edit: The URI 404 has not been addressed, and the custom vocabulary is evaluated as fragile by the other reviewers. I'm OK with it, but would have hoped for some persistence here; so I'll leave Resource Design Quality as "1: good".

Overall paper evaluation

Overall paper evaluation: 3: strong accept (was: 1: weak accept)

Detailed comments to the authors

Hi, I am Stian Soiland-Reyes http://orcid.org/0000-0001-9842-9718 and believe in open reviews.

I would appreciate if you could contact [email protected] if you agree on me publishing this review.

This review is licensed under a Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/ and is also available at the -secret- URL https://gist.github.com/stain/ce1f8884986d1109429f4cd0ed22c2d4

Edit: The authors agreed to publish this review.

This paper presents a dataset with detailed information about the Norwegian government's owned land and building mass.

The dataset uses a comprehensive vocabulary, and is provided with SPARQL endpoints and example queries.

The paper describes in good detail the motivation for creating the Linked Data representation, as well as how it was generated and post-processed.

The section on Related Work is sufficient - but I would have expected more wide-ranging coverage - for instance the UK Land Registry has a SPARQL endpoint at http://landregistry.data.gov.uk/app/qonsole - however this covers house prices in land transactions and not cadestral data on state-owned properties.

The submitted dataset is quite extensive, well-produced and as the paper show can be useful in multiple scenarios like a vulnerability analysis from natural hazards. While I understand this to be a hand-constructed example for the paper only, and the dataset is not currently used in "production", I would be happy to accept this paper, and look forward to hear on future uses and integrations with this dataset.

Confidential remarks for the program committee

Reviewer 4 (Stian Soiland-Reyes)

I disagree with reviewer #1, in some of the concerns over the dataset using an ontology that has not been peer-reviewed or have author overlap - I don't think that should be a criteria, as that would exclude many other good datasets like DBpedia.

Beyond this paper, the ontology has been made publicly available, and also announced:

https://blog.prodatamarket.eu/tag/prodatamarket-ontology/

I found the SWJ submission at: http://www.semantic-web-journal.net/content/prodatamarket-ontology-enabling-semantic-interoperability-real-property-data

Perhaps the ISWC paper should cite the above with "under review", that would help to explain some of the questions I had.

This submission should be about the submitted resource (the dataset), not the design or review of the ontology used. However the Choice of ontologies should be under scrutiny. I raised an issue with the vocabulary having "0.1" in the namespace - at least that means that any newer version of the ontology after SWJ review will not "break" this dataset.

However I do see some of #1's point - given that the ontology and dataset here go "hand in hand" and are not used elsewhere; coupled with the sparsity of links that go outside the dataset, this mean it does not fit that easily into the Linked Open Data cloud.

Reviewer 4 (Stian Soiland-Reyes)

I increased my overall vote to +3 for the resource submitted - a dataset and service which I now find well-deployed, well-designed (although with a custom ontology) and useful.

As the other reviewers have not pointed out any competing ontologies for reporting a State of Estate, or other vocabularies which should have been used instead, I disagree with their carte blance dismissal of this submission on the basis of the use of an "unpublished" vocabulary -- I don't think a primarily dataset providers should be required to do a waterfall model peer-review submission of first the vocabulary and only subsequently the dataset submission - this would easily give a chicken/egg situation.

It would however be a good overall requirement that the submitted SWJ vocabulary paper is provided as a preprint server with a "Submitted to SWJ" citation/link in this resource paper.

Rebuttal

(Withheld. tl;dr: metadata/license/void info added, download dump added to Zenodo as https://doi.org/10.5281/zenodo.818088; vocabulary paper under review in SWJ will be cited in final version)

Response to authors

Edit: The reviewers have addressed most of the the metadata/license/download issues. Other reviewers comment on fragility with the custom vocabulary. I think this can be addressed by updating the paper to have an explicit citation to the Vocabulary paper submission; added on the side of explicit citation to the data dump and a new "vocabulary dump" to Zenodo.

Edit: The reviewers have addressed most of the the metadata/license/download issues. Other reviewers comment on fragility with the custom vocabulary. I think this can be addressed by updating the paper to have an explicit citation to the Vocabulary paper submission (with public preprint); added as well with an explicit citation to the data dump and a new "vocabulary dump" to Zenodo.

Thanks to the authors for agreeing to my open peer review - I have published my part at https://gist.github.com/stain/ce1f8884986d1109429f4cd0ed22c2d4

I change my overall vote from "1: weak accept" to "3: strong accept"; on condition of those citation requirements for the revised paper.

stain/iswc2017-review-167.md

Review: ISWC 2017 Resources Submission 167

Evaluation

Assessment of the resource

Reusability

Resource design quality

Overall paper evaluation

Detailed comments to the authors

Confidential remarks for the program committee

Reviewer 4 (Stian Soiland-Reyes)

Reviewer 4 (Stian Soiland-Reyes)

Rebuttal

Response to authors