Skip to content

Instantly share code, notes, and snippets.

@no-reply
Last active March 23, 2017 20:09
Show Gist options
  • Save no-reply/6a635c7ced661c65aeea to your computer and use it in GitHub Desktop.
Save no-reply/6a635c7ced661c65aeea to your computer and use it in GitHub Desktop.
Records, Documents, & Graphs: Accounting for record scope & mutability in metadata management

Records, Documents, & Graphs

Accounting for record scope & mutability in metadata management.

Smoothies cannot be edited @anarchivist -- 6:52 PM PDT - 23 Apr 2015

Questions

The key question I'm setting out to answer is: How can we account for routine change and updates in our metadata records. An initial attempt to derive a model for change from current practice has led to some corollary questions about the relationship between Records, Documents, Description Sets, Application Profiles, Resources, and RDF Sourceslit review:

  • What is a Record?
    • Possible definition: Records are Documents instantiating a Description Set
    • Are Records mutable? What about Description Sets?
    • What other views of metadata records need to be taken into account?
  • Is a Description Set in DCAM equivalent to a (or a kind of) Graph in RDF?property value
  • What is the relationship between a Description Set (and by extension, a Record) and a Resource?
    • If we speak of a "record for Moby Dick", how do we distinguish that from a "record for Melville" that happens to contain some statements about Moby Dick? Is this a valid distinction under DCAM?
  • Is a Description Set an example of an RDF Source?

Records and Documents

The Dublin Core Abstract Model (DCAM) defines a metadata record as a document that instantiates a Description Setrecord. Description Sets, in turn, are defined as sets of (one or more) Descriptions, with each Description defined as a set of one or more Statements "about one, and only one, resource".

Functional Requirements for Bibliographic Records (FRBR) treats "Records" as an aggregation of "descriptive elements" and "filing devices" (IFLA, 1997; see especially Sec. 2.2). It's not clear from the definition given (or a loose reading of the remainder of the document) whether the IFLA Study Group's view is of a record as an abstract entity that can be updated, as a static representation of data, or as a literal physical document. While some combination appears to be at play, there seems to be an emphasis on the last.

The issue of Record mutability in both understandings raises the issues documented in Documents Cannot be Edited (Renear & Wickett, 2009). There is no model for revision in place and Description Sets lack identifiers of their own to pin revisions on. Records often likewise. Even taking a casual view of Records as physical documents, there seems to be little option but to view "revisions" as new documents which will be filed in roughly equivalent places to their predecessors in a card catalog or similar.

In the case of DCAM, the problem is compounded, since Description Sets are defined as sets (sets of sets of statements). This keeps the model close to that of RDF, but leaves the idea of a persistent, changeable Record out of the picture.

Mutability as a Requirement for Actionable Records

The view of Records implied by the above leaves us with significant problems for even basic metadata and asset management workflows. Our practice when describing a resource is to assume that new (and deleted) assertions update an old description. Our systems manage this with internal representations of state, controlled with database rows, or object representations, or otherwise; but usually without an articulated formal model. This won't do when we introduce Linked Data (or any large scale interoperability scheme). A shared model for mutability is needed.

[I would like to further document/articulate the nature of this requirement! What would we be lacking if we always saw records as static?]

Reviewing the RDF Model

  • Resources
  • Statements
  • Graphs
  • Datasets
  • RDF Source

Graphs are Immutable

Graphs are sets of statements.

RDF Source

We informally use the term RDF source to refer to a persistent yet mutable source or container of RDF graphs. An RDF source is a resource that may be said to have a state that can change over time. A snapshot of the state can be expressed as an RDF graph. For example, any web document that has an RDF-bearing representation may be considered an RDF source. Like all resources, RDF sources may be named with IRIs and therefore described in other RDF graphs.

As Resources, Sources can be denoted by an IRI or existentially quantified as a blank node. Further, a Source may be said to relate a time sequence of zero or more RDF graphs, with each graph representing a state of the mutable Resource at a given time.

Revisiting DCAM

A description is a set of statements that follow the one-to-one principle over the set. In explicit RDF terms, that is, a Graph whose triples share a single Resource as their subject node. On its face, this is very similar to the kind of "resource view" common on Linked Data publishing platforms that expose the triples "about" a given Resource. In practice, a description adds notions of constraint and completeness either through Description Templates (and Statement Templates) in a Description Set Profile or through less formal guidelines for vocabulary usage commonly included in Application Profiles.

The larger Description Set and its associated Record instantiations are, similarly, Graphs without the subject restriction. Any Graph can arguably be interpreted as a Description Set containing Descriptions for each of the Resources that appear as subjects in its triples; though there may be value in the view that a Graph is only a Description Set when viewed in the context of some set of constraints, or as a candidate expression of a "Profile" or "Shape"infinite .

Some Gaps

  • While a Record is said to instantiate a single Description Set, DCAM provides no mechanism for determining which Description Set is instantiated.
    • This points to an interpretation of Description Set as equivalent to Graph---both are defined as sets of statements, without the trappings that come with being a representation of a given Resource.
    • If this is the case then a Record instantiates a given Description Set merely by faithfully encoding the statements that make it up. This leaves no support for notions like "each metadata record is to represent exactly one book" as found in Sec. 6 of Guidelines for Dublin Core Application Profiles (Coyle & Baker, 2008).
  • Constraints and completeness are similarly problematic, since a single Record may be valid and complete for one profile, but not another.
  • ...

RDF Source

The RDF Source concept offers a potential solution for each of these problems.

Towards a Formalized Model for RDF Sources

While a common pattern (alluded to in RDF and Change Over Time) is to dereference the Source's IRI to get the current state of the Resource, it's not explicitly required that the representation express the current state. Nor is it necessary to retain each graph in the sequence, or that continuity be maintained.

Linked Data Platform codifies more specific patterns of dereferencability, including a requirement of fullness of the representation, and methods for updating "current persistent state". I've done some work to formalize similar handling of locally managed state-bearing Graphs in ActiveTriples in a comment on the GitHub issue "Resource-centric vs graph-centric in persistence/querying".

Removing the implementation specific language and restrictions:

  • An RDF Source is "a resource that may be said to have a state that can change over time". Therefore, it:
    • is a Resource
    • may be the referent of a URI.
  • An RDF Source has a Graph container.
    • A container is a mechanism for retrieving specific Graphs; a container may be, e.g.
      • a dereferencable URI (web address); or
      • a named graph; or
      • a language construct (an Object, or a Variable); or
      • a document; or
      • a memory block; etc...
    • The Graph in the container represents the Source's current state.

Problems for Provenance


Notes

[lit-review]: Literature review is still on-going, but I believe I've pulled in the relevant concepts. Some fashion of definition of each concept listed is attempted somewhere the main text.

[implementations]: While in LDP and ActiveTriples, the current state is represented by a specific Graph, in principle it's only necessary that some snapshots of state may be represented by Graphs.

[property value]: While working through this question, it has occurred to me that JSON-LD represents another example of this issue. Its graphs are expressed in documents as property value pairs in a model very similar to DCAM.

[record]: Specifically, it says a record is"An instantiation of a description set, created according to one of the DCMI encoding guidelines (for example, XHTML meta tags, XML and RDF/XML)." The tie to an encoding is significant, since it ensures that a record expresses at most one Graph.

[infinite]: Consider, for example the Graph of the web. It's not clear what use there is in viewing this as a Set with a functionally infinite number of Descriptions or why anyone would want to instantiate such a thing as a Record.


Bibliography

@kcoyle
Copy link

kcoyle commented May 2, 2015

I never read the intention of the Semantic Web to be nothing but a giant graph of contextless triples. Sir Tim's original proposal put data within the context of documents. The "control" (if any exists) is at the document level. Documents provide the "record-ness" by encapsulating a particular set of data along with human- and machine- understandable context, including administrative data about the document and data.

The GLAM world, for the most part, has separate metadata/data from the thing it describes. The metadata is the surrogate for the thing. The metadata, however, is also a document that is a representation. Just because we managed to wrap that document in tags and subfields doesn't change that fact. If we treat the metadata as a document, with some embedded data, then we no longer have free-floating data, we have a document that can have versions (updates), and that provides a container for the linkable data. (Note: as we deal more with digital materials, the separation of metadata and resource should no longer apply. Right?)

Thus, it seems to me that HTML, JSON-LD or XML are perfectly good containers, are handy for search and display, can be added to as needed, and can encapsulate both textual content and data that can be triple-ified. The next question is - how do we connect our triples to the LOD cloud while still serving our primary purpose of describing resources?

@no-reply
Copy link
Author

no-reply commented May 2, 2015

The "control" (if any exists) is at the document level. Documents provide the "record-ness" by encapsulating a particular set of data along with human- and machine- understandable context, including administrative data about the document and data.

+1. For me, this is the basic concept that RDF Source tries to "informally" capture. The common practice is to treat documents available on the web & containing RDF as the current representation of a given resource; usually the one dereferenced to get the representation. You can see this pattern at play in a lot of the early examples (e.g. Eiffel Tower) and RDF publishing projects (individual's FOAF files).

To some extent, this has all been formalized in RDF 1.1. When you get a document from such a source---whether it's a static document you've accessed from a disk, generated with SPARQL CONSTRUCT as a "Linked Data view", or otherwise put together just-in-time---what you're getting is a Graph (i.e. a set of Statements). By convention, that Graph is frequently viewed in a record-y way (I think in DCAM terminology, it's viewed as something like a Description Set, while the specific serialization is viewed as a metadata record).

The problem I'm trying to point to is that though we have a loose understanding of RDF documents in this way, as a client consuming published data, I'm left pretty much in the dark about what it means if I get two different Graphs when requesting a resource repeatedly. When is it safe to understand statements that are no longer present in the second Graph as invalidated? What if I can find the same statement elsewhere on the web--or even via a different request from the same source.

And if I'm publishing... what should I understand about the meaning of the presence or absence of a given statement in a particular Graph (or document, etc...); see, e.g., the "Moby Dick" v. "Melville" example re: Description Sets The models in use are varied, and there's a real lack of clarity about what the options are and how implementation choices affect semantics.

If we treat the metadata as a document, with some embedded data, then we no longer have free-floating data, we have a document that can have versions (updates), and that provides a container for the linkable data.

I guess I'm not convinced that pinning a record state to a Document gets us much. For one thing, it's not any more clear what it means for a document to be updated than for a Graph. How do I know that two similar documents are versions of each other? How do I know which one invalidates the other?

More than that, we work often with stateful Resource representations that we wouldn't want to understand as Documents at all. For example: ActiveTriples tries to encapsulate current descriptions of resources in Graphs that exist abstractly in memory. What patterns can I use to connect those representations with HTTP requests and documents I'll send to users; or with Graphs that I'll persist (equally abstractly) in a triplestore?

I'm digesting what you've said about DSP and Shapes. I have a lot of thoughts about this, and am still trying to organize them even for myself.

@kcoyle
Copy link

kcoyle commented May 2, 2015

DSP vs. DCAM
The Dublin Core Abstract Model does indeed define descriptions and description sets. I guess I go with the DSP because it is an attempt to be actionable. But you are correct in using DCAM in your description, since that's the basis for DSP. Unfortunately, the DCAM was a total flop, mainly due to the deep obscurity of its terminology. "Non-literal value surrogate" anyone? (And of course the irony of this coming from DC, whose previous work was purposely accessible to normal human beings.) I think if it had been expressed differently it would have given us something to work with. As it is, I'm not sure that the distinctions that are made are always useful, which is why I'd like us to do a version 2 using existing RDF concepts where appropriate. (Acc. to TomB, the group was trying to avoid using any RDF terminology, in part because RDF was less well formed at the time.)

We can still have descriptions, description sets and statements. "Value surrogate" etc. has got to go -- it's a value. Values can be literals or IRIs; literal values can be strings or typed data. That bit of simplification goes a long way to making the DCAM digestible. I started something along these lines here:
http://wiki.dublincore.org/index.php/RDF_Application_Profiles/DSPanalysis

@kcoyle
Copy link

kcoyle commented May 2, 2015

" I guess I'm not convinced that pinning a record state to a Document gets us much. For one thing, it's not any more clear what it means for a document to be updated than for a Graph. How do I know that two similar documents are versions of each other? How do I know which one invalidates the other? "

When I say "document" I don't mean "graph" in the RDF sense. I really mean "document" in the sense that it is a bounded set. That set can be bounded between <html></html>, <xml></xml> or it can be any other thing with bounds, but there does need to be a way to say "it's what's inside this." That set also will have to have an identifier, and it will have to have version information. I see it as working similar to a wiki, or even github.

The W3C Shapes group is entirely focused on validation of individual graphs, and interestingly versioning has not come up. It seems to me that this is a flaw in their thinking, but I doubt if I'm the one to get it across to them. As to your question of retrieving a graph multiple times and getting different results - they seem to be seeing each interaction as a separate validation act. There's one person in the group who makes a lot of sense -- I think I'll ping him separately on this topic.

@aisaac
Copy link

aisaac commented May 5, 2015

A colleague of mine has pointed me to this discussion. I have not much to add, no time to think about the theory. But if you need use cases and requirements, the problem of representation/versions had to be tackled for the ongoing EuropeanaCloud project, where they (I was not much involved) created a model for records and datasets. It's not RDF, and there's not much documentation besides an old deliverable http://pro.europeana.eu/files/Europeana_Professional/Projects/Project_list/Europeana_Cloud/Deliverables/D2.2%20Europeana%20Cloud%20Architectural%20Design.pdf and the current API: http://sonar.eanadev.org/job/oncommit-eCloud/MCS_REST_API/index.html

@no-reply
Copy link
Author

no-reply commented May 5, 2015

@Aissac: Thanks. I'm reading this.

@no-reply
Copy link
Author

no-reply commented May 5, 2015

I added a loose distillation of the model I've introduced to ActiveTriples, removing implementation specific restrictions. I expect to put more substantial additions in in the next couple of days, as I regain the energy to work on this.

@kcoyle I think the bullets I added outlining an RDF Source may help clarify where I'm going with this and how it relates to your conception of a "document". Oddly, I think the Shapes group is (probably) right to be focusing on the validation of individual graphs. What RDF Source gets us is a target for concepts like:

  • this set of constraints applies to this Resource (or Class, or Source with these qualities).
  • this Resource was valid at time x and invalid at time y.

as well as "this description set is about a book" and similar.

@kcoyle
Copy link

kcoyle commented May 11, 2015

You've asked elsewhere why I focus on DSP rather than DCAM -- it's mainly because "abstract" just doesn't interest me. But here's what I have to say about DCAM:

  1. terminology: yikes!
    a)if you just change "surrogate" to "type" or "representation" or something similar it suddenly becomes much more readable
    b) syntax encoding scheme is a data type. Call it a data type.

  2. the vocabulary encoding scheme is, as far as I can tell, a validation issue, not an abstract model issue, and should be part of the DSP, not the DCAM. You basically need to have either a single URI as a value, a set of URIs as a value, or the ability to validate against a URI pattern. I have included in the requirements for Shapes the ability to indicate a URI pattern against which values can be compared. So you could have "http://id.loc.gov/" or "http://id.loc.gov/names/" etc. And the arrows from "non-literal value -> member of -> vocabulary encoding scheme" unnecessarily complicate the diagram and I don't see why you would need to know that a non-literal value is a member of a vocabulary encoding scheme as part of your abstract model.

  3. Another argument against the vocabulary encoding scheme here, and for putting it in the DSP, is a lack of a parallel for literal values. You may want to have a set of string values that you validate against ("red, blue, green"). You could even want to apply some regex-type validation to those (e.g. word stemming). Oddly, multiple value strings are not allowed. Even more oddly, multiple language strings are allowed for each plain value string, which, AFAIK, is an error.

If you remove these "oddities" you basically get the elements of RDF, minus the concept of classes. (Interestingly missing from the DCAM.) DCAM adds the record structure, but that could be considered the intro to the DSP. So my preference is to assume RDF/S, and apply that to the DSP to define a record. I just don't see a whole lot of value in the DCAM as it is today, other than as an introduction to the DSP.

@no-reply
Copy link
Author

Thanks Karen.

I agree totally about encoding schemes and the like. These are overly prescriptive as part of an abstract model, not well adopted (most systems apply these constraints on the property-level, not the value), and arguably not good practice for many uses.

My hope for DCAM in general, and part of my reason for undertaking a close reading of the literature about it, is that it can provide the basis for reading common non-RDF metadata as RDF equivalent. The language and the class diagrams, I agree, are a mess; and a lot of key concepts are left to implicature or are just too vague to be useful. (For instance: though there's much talk about the "described resource" and "property-value pairs", I can't see it stated anywhere that properties are to be understood as properties of the "described resource".)

Your update improves things. I think I would add:

  • A "description set" is equivalent to a RDF "Graph".
  • A "description" is equivalent to an RDF "Graph", with the constraint that all of its Statements must have the same subject (the "described resource").
    • In an RDF context, we can (but don't have to) dispense with this altogether. Still, somehow it feels to me like the main value in DCAM.
  • A DCAM "statement" is equivalent to an RDF "Statement", with an implied subject of the "described resource".
    • I may be misunderstanding you, but I think this fixes the problem of multiple value strings.
    • I'm not entirely clear on whether a 'description' without a "resource URI" maps to a blank node. Formally, bnodes assert a resource while DCAM 'descriptions' seem to want to assert this resource. But this problem extends far beyond DCAM.

I think this still leaves pretty much all of my questions above unanswered, but at least the mapping between the two models is clarified as far as it goes. :)

@mjsuhonos
Copy link

Hi all -- I've added a gist with some of my own (parallel/related/semi-formed) thoughts related to this document:

https://gist.github.com/mjsuhonos/9d4922cf85627ed909e2

I really do think the terminology is important, and I try to take a stab at aligning some of it (ie. what Karen calls a "document" I basically equate to an "object", plus its direct neighbours, which can be considered an RDF graph).

The main issue I have with treating RDF graphs as immutable, atomic units is when they contain entirely unrelated objects or even indirected (neighbour-of-a-neighbour-of-a-neighbour, etc) objects. Sure, this is valid RDF, but it's really hard to model in an object-document sense, and seems inherently fragile. I don't think we would be likely to see a traditional "record" contain this degree of indirection.

Anyway, I'm going to re-read through this thread a few more times and will try to add any (hopefully marginally useful) thoughts if/as they materialize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment