Records, Documents, & Graphs

Accounting for record scope & mutability in metadata management.

Smoothies cannot be edited @anarchivist -- 6:52 PM PDT - 23 Apr 2015

Questions

The key question I'm setting out to answer is: How can we account for routine change and updates in our metadata records. An initial attempt to derive a model for change from current practice has led to some corollary questions about the relationship between Records, Documents, Description Sets, Application Profiles, Resources, and RDF Sources^{lit review}:

What is a Record?
- Possible definition: Records are Documents instantiating a Description Set
- Are Records mutable? What about Description Sets?
- What other views of metadata records need to be taken into account?
Is a Description Set in DCAM equivalent to a (or a kind of) Graph in RDF?^{property value}
What is the relationship between a Description Set (and by extension, a Record) and a Resource?
- If we speak of a "record for Moby Dick", how do we distinguish that from a "record for Melville" that happens to contain some statements about Moby Dick? Is this a valid distinction under DCAM?
Is a Description Set an example of an RDF Source?

Records and Documents

The Dublin Core Abstract Model (DCAM) defines a metadata record as a document that instantiates a Description Set^record. Description Sets, in turn, are defined as sets of (one or more) Descriptions, with each Description defined as a set of one or more Statements "about one, and only one, resource".

Functional Requirements for Bibliographic Records (FRBR) treats "Records" as an aggregation of "descriptive elements" and "filing devices" (IFLA, 1997; see especially Sec. 2.2). It's not clear from the definition given (or a loose reading of the remainder of the document) whether the IFLA Study Group's view is of a record as an abstract entity that can be updated, as a static representation of data, or as a literal physical document. While some combination appears to be at play, there seems to be an emphasis on the last.

The issue of Record mutability in both understandings raises the issues documented in Documents Cannot be Edited (Renear & Wickett, 2009). There is no model for revision in place and Description Sets lack identifiers of their own to pin revisions on. Records often likewise. Even taking a casual view of Records as physical documents, there seems to be little option but to view "revisions" as new documents which will be filed in roughly equivalent places to their predecessors in a card catalog or similar.

In the case of DCAM, the problem is compounded, since Description Sets are defined as sets (sets of sets of statements). This keeps the model close to that of RDF, but leaves the idea of a persistent, changeable Record out of the picture.

Mutability as a Requirement for Actionable Records

The view of Records implied by the above leaves us with significant problems for even basic metadata and asset management workflows. Our practice when describing a resource is to assume that new (and deleted) assertions update an old description. Our systems manage this with internal representations of state, controlled with database rows, or object representations, or otherwise; but usually without an articulated formal model. This won't do when we introduce Linked Data (or any large scale interoperability scheme). A shared model for mutability is needed.

[I would like to further document/articulate the nature of this requirement! What would we be lacking if we always saw records as static?]

Reviewing the RDF Model

Resources
Statements
Graphs
Datasets
RDF Source

Graphs are Immutable

Graphs are sets of statements.

RDF Source

We informally use the term RDF source to refer to a persistent yet mutable source or container of RDF graphs. An RDF source is a resource that may be said to have a state that can change over time. A snapshot of the state can be expressed as an RDF graph. For example, any web document that has an RDF-bearing representation may be considered an RDF source. Like all resources, RDF sources may be named with IRIs and therefore described in other RDF graphs.

RDF and Change over Time. RDF Concepts and Abstract Syntax.

As Resources, Sources can be denoted by an IRI or existentially quantified as a blank node. Further, a Source may be said to relate a time sequence of zero or more RDF graphs, with each graph representing a state of the mutable Resource at a given time.

Revisiting DCAM

A description is a set of statements that follow the one-to-one principle over the set. In explicit RDF terms, that is, a Graph whose triples share a single Resource as their subject node. On its face, this is very similar to the kind of "resource view" common on Linked Data publishing platforms that expose the triples "about" a given Resource. In practice, a description adds notions of constraint and completeness either through Description Templates (and Statement Templates) in a Description Set Profile or through less formal guidelines for vocabulary usage commonly included in Application Profiles.

The larger Description Set and its associated Record instantiations are, similarly, Graphs without the subject restriction. Any Graph can arguably be interpreted as a Description Set containing Descriptions for each of the Resources that appear as subjects in its triples; though there may be value in the view that a Graph is only a Description Set when viewed in the context of some set of constraints, or as a candidate expression of a "Profile" or "Shape"^infinite .

Some Gaps

While a Record is said to instantiate a single Description Set, DCAM provides no mechanism for determining which Description Set is instantiated.
- This points to an interpretation of Description Set as equivalent to Graph---both are defined as sets of statements, without the trappings that come with being a representation of a given Resource.
- If this is the case then a Record instantiates a given Description Set merely by faithfully encoding the statements that make it up. This leaves no support for notions like "each metadata record is to represent exactly one book" as found in Sec. 6 of Guidelines for Dublin Core Application Profiles (Coyle & Baker, 2008).
Constraints and completeness are similarly problematic, since a single Record may be valid and complete for one profile, but not another.
...

RDF Source

The RDF Source concept offers a potential solution for each of these problems.

Towards a Formalized Model for RDF Sources

While a common pattern (alluded to in RDF and Change Over Time) is to dereference the Source's IRI to get the current state of the Resource, it's not explicitly required that the representation express the current state. Nor is it necessary to retain each graph in the sequence, or that continuity be maintained.

Linked Data Platform codifies more specific patterns of dereferencability, including a requirement of fullness of the representation, and methods for updating "current persistent state". I've done some work to formalize similar handling of locally managed state-bearing Graphs in ActiveTriples in a comment on the GitHub issue "Resource-centric vs graph-centric in persistence/querying".

Removing the implementation specific language and restrictions:

An RDF Source is "a resource that may be said to have a state that can change over time". Therefore, it:
- is a Resource
- may be the referent of a URI.
An RDF Source has a Graph container.
- A container is a mechanism for retrieving specific Graphs; a container may be, e.g.
  - a dereferencable URI (web address); or
  - a named graph; or
  - a language construct (an Object, or a Variable); or
  - a document; or
  - a memory block; etc...
- The Graph in the container represents the Source's current state.

Problems for Provenance

Notes

[lit-review]: Literature review is still on-going, but I believe I've pulled in the relevant concepts. Some fashion of definition of each concept listed is attempted somewhere the main text.

[implementations]: While in LDP and ActiveTriples, the current state is represented by a specific Graph, in principle it's only necessary that some snapshots of state may be represented by Graphs.

[property value]: While working through this question, it has occurred to me that JSON-LD represents another example of this issue. Its graphs are expressed in documents as property value pairs in a model very similar to DCAM.

[record]: Specifically, it says a record is"An instantiation of a description set, created according to one of the DCMI encoding guidelines (for example, XHTML meta tags, XML and RDF/XML)." The tie to an encoding is significant, since it ensures that a record expresses at most one Graph.

[infinite]: Consider, for example the Graph of the web. It's not clear what use there is in viewing this as a Set with a functionally infinite number of Descriptions or why anyone would want to instantiate such a thing as a Record.

Bibliography

Libraries, Languages of Description, and Linked Data. Baker. 2011
Establishing Trust in Data Integration Projects. Origins. 2015
Documents Cannot Be Edited. Renear & Wickett. 2009.
Description Set Profiles. DCMI, Nilsson. 2009
Dublin Core Abstract Model. DCMI. 2007
Formalizing Dublin Core Application Profiles in Metadata & Semantics. Nilsson. 2009
Guidelines for Dublin Core Application Profiles. Coyle & Baker. 2008
Functional Requirements for Bibliographic Records. IFLA, 1997 (amendments through 2009).

You've asked elsewhere why I focus on DSP rather than DCAM -- it's mainly because "abstract" just doesn't interest me. But here's what I have to say about DCAM:

terminology: yikes!
a)if you just change "surrogate" to "type" or "representation" or something similar it suddenly becomes much more readable
b) syntax encoding scheme is a data type. Call it a data type.
the vocabulary encoding scheme is, as far as I can tell, a validation issue, not an abstract model issue, and should be part of the DSP, not the DCAM. You basically need to have either a single URI as a value, a set of URIs as a value, or the ability to validate against a URI pattern. I have included in the requirements for Shapes the ability to indicate a URI pattern against which values can be compared. So you could have "http://id.loc.gov/" or "http://id.loc.gov/names/" etc. And the arrows from "non-literal value -> member of -> vocabulary encoding scheme" unnecessarily complicate the diagram and I don't see why you would need to know that a non-literal value is a member of a vocabulary encoding scheme as part of your abstract model.
Another argument against the vocabulary encoding scheme here, and for putting it in the DSP, is a lack of a parallel for literal values. You may want to have a set of string values that you validate against ("red, blue, green"). You could even want to apply some regex-type validation to those (e.g. word stemming). Oddly, multiple value strings are not allowed. Even more oddly, multiple language strings are allowed for each plain value string, which, AFAIK, is an error.

If you remove these "oddities" you basically get the elements of RDF, minus the concept of classes. (Interestingly missing from the DCAM.) DCAM adds the record structure, but that could be considered the intro to the DSP. So my preference is to assume RDF/S, and apply that to the DSP to define a record. I just don't see a whole lot of value in the DCAM as it is today, other than as an introduction to the DSP.

no-reply/thoughts.md

Records, Documents, & Graphs

Accounting for record scope & mutability in metadata management.

Questions

Records and Documents

Mutability as a Requirement for Actionable Records

Reviewing the RDF Model

Graphs are Immutable

RDF Source

Revisiting DCAM

Some Gaps

RDF Source

Towards a Formalized Model for RDF Sources

Problems for Provenance

Notes

Bibliography

no-reply commented May 5, 2015

no-reply commented May 5, 2015

kcoyle commented May 11, 2015

no-reply commented May 11, 2015

mjsuhonos commented Jun 14, 2015