Slides: http://www.slideshare.net/azaroth42/community-challenges-for-practical-linked-open-data-linked-pasts-keynote This file: http://bit.ly/usable_lod
Comments very welcome below!
Approximately what I said during the keynote.
[1] Good morning! Thank you Leif for the introduction, and for the invitation to come and give the keynote presentation. I'm honestly excited to be here, as my relatively recent move to the Getty from Stanford has given me an amazing opportunity to become more deeply involved in both Linked Open Data, and how to use it pragmatically to describe objects and the events that carry them through history.
[2] I’m going to start with a brief meta-level "header" about Keynotes, if you'll indulge me. There are three types of Keynotes, in my experience. The first two types tend to bore me to tears, so I hope not to repeat their mistakes!
[3] This is not a “CV” sort of keynote, where I just talk about things that I’ve done. If you want to read my CV, well ... you can read my CV ... or better, come and talk to me over the next couple of days!
[4] And it’s not a dry, background work, scene setting, domain survey. You already know all of that, and while you might have heard about a couple of things that you didn’t know about, you might also have been asleep by that point.
[5] Which leaves the third type: the call to action. I’m going to try and lay out where I think we, as a community should be trying to go. How can we make a difference beyond these two days, when we get back to our regular work. And I’m not going to be coy and keep you in suspense as to what I want us to do ...
[6] It’s pretty simple. We should come together as a Community to agree on how Best to create and publish linked open data about historical events and their participants.
[7] And then, of course, to go home and actually do something about it :)
[8] That's also pretty much the outline of the presentation. I’m going to talk about Community, which leads into what I think constitutes good linked data, and then about creating and publishing it.
[9] In my experience of the technical information standards world, starting with Z39.50 and SRU about 20 years ago, the most successful specifications about how to do something are those that come from Communities, not as mandates from self-appointed Committees.
Direction and vision are required... the focus of the community needs to be articulated clearly and convincingly. Communities need awareness and understanding of the problems that they are facing, and the motivation to work together to overcome them.
There needs to be leadership, but not at the expense of being open and welcoming. Not at the expense of engaging and understanding what is actually needed. Instead the focus gives a methodology for community based decision making, ensuring that the problems being solved are real for the members of the community.
Participation is the key requirement in solving practical challenges, not reputation. And Active participation, not just lurking and occasionally making a snide comment about how nothing is getting done. A community that isn't doing anything, isn't a community. It needs to not only think about the end result, but actively consider and adapt how it is getting there.
So Flexible, not Agile? I prefer the notion of the community being flexible but strong – when the willow tree sways with the wind it’s flexible, but is true to its purpose. A cat agilely avoiding a dangerous situation doesn’t address the danger, it just avoids it. Agility lets you dance around the problem, Flexibility lets you overcome it.
Focused, Open, Active, Flexible ... sounds great, but how do we get there?
[10] Katherine Skinner of Educopia introduced me to this notion of the community engagement pyramid. There are a few people at the top of the pyramid, and increasingly many as you move down the tiers. In the IIIF community, for example, there are probably less than 5 active leaders, but a good 10-20 experts and advocates, maybe 50-100 contributors, 500 or so members that aren't constantly engaged but actively following, and then an impossible to determine number of people on the edges looking in. Or up.
The point is not to have a hierarchical and fixed structure, but to recognize that people look upwards and it is the sign of a healthy community when everyone on the tier above is reaching down to help those who want to take on a bigger role to do so.
In order to be successful in pulling off the amazing transformation of cultural heritage into linked open data, we need to have a solid understanding of how to work together strategically, while advancing our own organizations' immediate goals.
[11] Catherine Bracy, Director of Community for Code for America, gave an outstanding keynote at the Museum Computer Network in November, and I'm shamelessly echoing her points on how community leadership can most effectively work because they're bang on.
There are four easy steps: First, know your audience. Who is the community, and who, as a community leader, do you need to be working directly with. Secondly, reach out to those people on their terms, not on yours. You need them to participate, and for that, they need to understand and agree with the goals and direction. Thirdly, have a continuing conversation with the community about everything! And make opportunities for people to actively and meaningfully participate. Through participation you're building the ladders to bring them to the next level of the pyramid.
Let's get more concrete...
[12] And perhaps a little controversial ... Who IS the audience for Linked Open Data?
[13] Developers. Developers, Developers everywhere. And I mean that literally, not just your developers, but developers everywhere. Developers who cannot come to your office and ask why an E12 Production is an Activity, but an E6 Destruction is only an Event. Oh and if you can explain that to me, that'd be great too.
[14] Lets look at the LOD community pyramid, rather than the generic one. There are few ontologists, architects and leaders who discuss, design and create ontologies. Then there are experts and content providers who understand and use those ontologies to made their data available ... to developers to build applications for the users, and one of the most focused on class of user is researchers.
Individuals can fit into multiple tiers, and you should think of them as if they were the highest of those tiers. A researcher than can also write code or SPARQL queries is a developer. A content provider that can design a good ontology rather than follow a recipe is an ontologist.
[15] Meet the developers on their terms. Don’t go to developers and talk about ontologies and triples and namespaces and reification and inference and sparql and quadstores and turtle and ...
You lost them way back at ontologies. Instead talk about JSON, and APIs, and HTTP, and content, and applications. Because remember, developers are your gateway to the users of their applications. So as Lee says, listen to them and engage in a way that makes them feel needed and wanted, because they are!
[16] And don't just talk AT them, discuss WITH them. Have a conversation about what they need, what you can do to help them, and of course what they know their users need. Some middle aged white guy [cough] talking at an audience is a presentation, not a community.
[17] Finally, create opportunities for the community to participate meaningfully. Not just listening but actively engaging. Can the developers help fix your bugs? Can their users annotate your content with corrections and suggestions of relationships?
Listening to and acting on feedback is important, but think about ways for others to get involved. Can you provide APIs to let developers and users actually do something directly?
I love the expression of the woman in the middle. She's REALLY dubious about something! Maybe ...
[18] Ha! Said no Developer, Ever. Meeting on their terms, remember...
[19] That seems much more likely! Or at least, that's what I hear most often.
[20] So let's look at how this workflow plays out. It starts with the data being created – which is expensive! Money goes in, and something called "triples" comes out the other end of what appears to most management like a black box.
[21] Those triples go to the developer, who has to actually DO something with them to build a web application ...
[22] Which is then used by researchers (and others) to form hypotheses, do research and write papers. But each of them have different expectations ...
[23] For the creation and transformation of the dataset, there should be “No data left behind” – the ontology and model should be thorough and complete, a good knowledge representation.
The developer however wants the data to be usable and understandable, otherwise he can’t do his job of making it available.
And the researcher, not knowing any of the process behind the HTML application, needs the data to be accurate, otherwise when using it and comparing it with other data, his research will be equally inaccurate.
[24] What I've come to realize is that Linked Data can be Complete, Usable, Accurate ...
[25] ... Pick one.
[26] ... And pick Usable.
[27] It would be great to pick all three, but like “Good, Cheap, Fast, Pick two” it’s a matter of limited resources and priorities. For today, let’s take Accuracy off the table. Accuracy is a direct function of resources (spending time to find and fix errors), not the model or its representation. And we’re not going to solve the challenges of humanities funding today.
[28] By trying to make the data both Complete AND Usable, we're trying to optimize for two independent variables at the same time, with different purposes and most importantly different audiences. Are we prioritizing the ontologist and data manager, or the developer that has to work with the result? In my experience, we tend to meet our own requirements first and hope that the developer can make do with what they get.
[29] Optimizing for completeness fulfills the knowledge representation use case. But Linked Data also has a protocol aspect – it’s a fundamental part of LOD that the representation is available at its URI via HTTP.
[30] This means that we are implicitly also designing, and hopefully co-optimizing for, a Usable API. And this is where the Completeness and Usability axes become more complicated. The question is not only can I understand and use the model or data in that model, it’s how easy is it to use that data in the way it is made available – its serialization and the network transfers.
I think the chart of ontologies looks something like this ...
[31] At the beginning, any completeness adds to usability. Then there's a dramatic rise in usability, without adding so much completeness when you hit the sweet spot of "just enough data". Then as you add more to the ontology, it starts to drop off slowly at first ... not so much as to be a significant problem ... but then you reach the tipping point where it becomes incomprehensible, and as completeness tends to 100%, usability tends to 0%
Okay, I know you want examples, so some more potential controversy ...
[32] Bibframe 1.0 was terrible. It was complex without actually addressing the issues, worse than Simple Dublin Core, which at least has enough to get /something/ done with. Then you have frameworks further up the usability scale like Web Annotations and EDM, and a little more complete, but by no means everything you want to say.
The IIIF Presentation API is, demonstrably, about as usable as a linked data API can get ... but we're constantly fighting to stay at high usability by resisting requests to add features.
Then comes the slippery slope that schema.org is further down ... still usable for now, but they're constantly adding to it ... and not in a sustainable or directed fashion ... until you hit rock bottom with CIDOC CRM and the meta-meta-meta statements available via CRMInf.
[33] The zone most ontologies should aim to be in, in my opinion, is the top right hand corner ... maximizing usability and completeness, maximizing the area between 0,0 and the x,y point reached. The community needs to take into account where on that slope will result in greatest adoption, as 100%/100% is a lovely dream, but impossible in reality.
[34] For our wedding anniversary, my wife and I drove up to see the giant redwoods in Northern California. There’s a living tree you can drive your car through even. However, there’s still the forest surrounding the trees that people want to look at, and without the forest, those trees would fall in the next storm. You can cut down a few trees to make sure that the rest survive and people can see them, but there’s no need to build paths to every tree. Like the forest, you can and should keep data around towards completeness and stability of the whole, without exposing it to developers. In the Getty vocabularies, we have a changelog of every edit for every resource … we publish that, and it just gets in the way of understanding for no value. We don’t have to throw it away completely to increase usability, we can just leave it out of the API.
So how do we understand what we should publish?
[35] How would we know where we are on the chart? The other stages have reasonably well understood evaluation processes -- the formal validity of the ontology, the extent to which it can encode all of the required information, unit and integration tests for code, user acceptance tests for the application, user personae to guide development ... but how should we evaluate the quality of the API we're providing to our data?
[36] Michael Barth lays out six fundamental features for API evaluation.
Abstraction Level -- is the abstraction of the data and functionality appropriate to the audience and use cases. An end user of the "car" API presses a button or turns a key. A "car" developer needs access to engine controls, all the modern safety features and so forth.
Comprehensibility -- is the audience able to understand how to use it to accomplish their goals
Consistency -- if you know the "rules" of the API, how well does it stick to them? Or how many exceptions are there to a core set of design principles (like Destruction not being an Activity)
Documentation -- How easy is it to find out the functionality of the API?
Domain Correspondence -- If you understand the core domain of the data and API, how closely does the understanding of the domain align with an understanding of the data?
And finally, what barriers to getting started are there?
[37] Sometimes I hear linked data experts say "You should just use Federated SPARQL Queries". But SPARQL, let alone federations of systems, performs very poorly on all of those metrics. To explain why "consistency" is "mediocre", that's because everyone has different underlying models and exceptions, and SPARQL is a complex but very thin layer over those models.
So when I hear people say "You should use SPARQL", my internal reaction is ...
[38] ... Now you have more problems than you can count.
[39] And even if you love SPARQL, the incontrovertible fact is that when you compare to REST + JSON developers, the Venn diagram looks something like this. Sorry it's a little hard to see, let me zoom in a little for you ...
[40] There MIGHT be one SPARQL developer who doesn't know JSON, but I doubt it.
[41] Okay, so what do we need to have available in that data? (This is the brief background, scene setting bit)
The scope, in my view, is broadly covered by "historical activities", and the participants in those activities. We need a model for describing them, and shared resources such as people, places and objects, should have shared identities. There's no need to get into philosophical wars right now about the best ontology or the most appropriate sources of shared identity, so what CAN we practically and pragmatically discuss?
[42] Let's start with serialization. JSON-LD. It's JSON with explicit, managed structure. The keys are named in a way that’s easily understandable to humans and easy to use when programming. No hyphens, no numbers, no strange symbols everywhere, …
And let me explain the significance of the colors in the example. All the blue strings are URIs, including "ManMadeObject" and "Material". The only actual string literals are the two red labels. JSON-LD lets you manage the complexity of the graph in a way that ends up familiar to the audience, the developer, not daunting to them. Remember: Meet On Their Terms.
[43] Or ... Curly brackets are the new Angle brackets.
[44] If we’ve solved serialization ... well, have a direction at least ... what ARE the challenges we need to work on? Let’s count down my top 5.
Coming in at number 5 … Order!
[45] History is Ordered, and order is hard in RDF, right?
We’re fortunate that history is ordered globally by the steady march of time, not locally. For historical events, we can be universally correct, Dr Who time travel aside, in saying that an event in 1500 occurred after an event in 1490. No need to worry about explicitly ordering them, when applications can use the timestamps to do it themselves, as needed.
But for local ordering, use rdf:List. The serialization in JSON-LD is good, sparql 1.1 supports property chains, and SPARQL support shouldn’t be our main concern anyway.
[46] Number 4 is "boundary of representation" ... or which triples should be included when you dereference a particular URI.
[47] This is a critical point for the use of Linked Data as an API. We need to optimize the representations for use cases, based on the developer audience and what they need. There's no one rule that can generally determine the best practice here.
Note that the terms from AAT are used both by reference -- the object is classified as a painting, but without even a label -- and by value -- we're explicit that aat:300015045 is a Material, and it has a label of "watercolor". Why? Well, why indeed?! This is just an example, but one that we need to question with a critical eye, and discuss with developers as to whether it meets their needs.
JSON-LD does have an algorithm called Framing that makes this sort of effort much more consistent and efficient. Also, we might take a leaf or two out of Facebook's GraphQL book, where the request can govern some aspects of the response's representation.
[48] Number 3! The many levels of meta-data that we try to squash into a single response.
[49] RDF is good at making statements about reality, and bad at making statements about statements. At three levels its terrible, and when you end up trying to make statements about what you believe that others used to believe, you’ve gone too far into the dream and need to get back to reality.
Inception made for a cool movie, but would be a terrible API. Use VOID for making broad statements about your dataset, and leave it at that. For example … Associate a license with the dataset, not with each resource … You’re not letting people use Rembrandt, the actual person, with a CC-BY license so don't claim that (like ULAN currently does). And don't reify everything in order to make mostly blanket statements about certainty or preference. No one is absolutely certain about anything, and everyone has a preference about which label or description to use ... but don't try and encode all of those subjective assertions against each triple!
[50] Of course ... naming things.
[51] Plato has Socrates discuss the correctness of names in Cratylus, which leads to the theory of forms, and nominalism. As a race, we haven’t solved this question in the last 2500 years, so I would predict that we’re unlikely to solve this today either.
We have this problem in several different guises: The URIs of Predicates, and their JSON-LD keys The URIs of instances, particularly ones that are common across datasets, and hopefully across organizations
The only way to know the "best" or "most correct" name for something is through use and discussion.
[52] And if number 2 is Naming Things, then number 1 must surely be Cache Invalidation.
[53] And it really is. Efficient change management in distributed systems without a layer of orchestration is literally asking us to solve predicting the future.
The best we can do is add a lightweight notifications layer, using a publish/subscribe pattern with standards like LDN (Linked Data Notifications) and WebSub. Both are currently going through the standardization process within the Social Web Working Group of the W3C.
That gives us distributed hosting, but the potential for centralized discovery and use, where applications are informed about changes to remote systems that have been updated, so that they can update their caches of that information.
[54] And finally Number Zero, Off-By-One errors.
[55] Home stretch folks, thank you for your attention so far!
Practical Linked Open Data... P L O D .. Plod. I've gotta say, Leif, it's not an awe inspiring acronym, sorry. Mr Plod the Police Officer is just not the most exciting figure. Police make you think of being told to stop, and we want to move forwards and learn from our mistakes. To get something out there, and iterate.
[56] Remember ... We Want U. U for Usable.
[57] So my call to action, fans, is to Get LOUD. Linked Open Usable Data.
[58] With the Community ... Community Linked Open Usable Data ... or the CLOUD. (And it wouldn't be a linked data keynote without the LOD CLOUD diagram, would it!)
[59] Community (or connected) Linked Open Usable Data.
The community includes everyone in that pyramid, and together we can share the burdens across the levels. Shared ontologies, shared identities, shared code, shared use cases ... but also think about this ... By linking to other peoples' data, you're reducing your own completeness burden. And theirs at the same time. Not everyone needs a complete description of Rembrandt or Plato.
Enabling users and developers to provide feedback on your data reduces your burden of Accuracy. They have the possibility to work with you to correct it.
And working directly with developers, regardless of whether they're in your organization or not, validates your Usability. Barth has several really good options for how to put that evaluation into practice, I strongly recommend his paper.
And remember FOAF ... who got that first time round? ... Focused, Open, Active, Flexible.
[60] The audience for your actual Linked Data is the developers within your community. We need to meet on their terms, and allow them (and their users) to participate in the creation and management process for the data. We need to focus on usability of the data, not necessarily the completeness nor the accuracy.
A good way to do that is through the use of JSON-LD, with frames governing the graph boundaries and validated through use cases and discussion about the names used. Let them know, through notifications, when the data is updated.
Let developers help with usability, let the community help with completeness, and let the users help with accuracy.
You’ll have to deal with the off by one errors yourself, unfortunately.
[61] Thank you!
[62] I don't want to "take questions". I want us, as a community, to discuss :) So ... Discussion?
Discussion it is then:
But at least one of the most used Java frameworks (rdf4j) doesn't support optional paths efficiently: eclipse-rdf4j/rdf4j#695.
The root of the problem is that
?x prop? ?y
must return all nodes in the repo.Don't know what's the case with other SPARQL processors (eg Jena ARQ).
DBpedia and Wikidata contain no
rdf:List
so can't check their speed re lists:select * {?x a rdf:List}
.Can you try that at some other endpoints (see list at SparqlES)?
I think GVP does the right thing: uses a custom shape/CONSTRUCT to return all data about each independent entity (concept, contributor, source)
Framing can help you reshape it, or subset it (eg Simple vs Complete profile).
GraphQL is great, agreed.
LOUD & CLOUD: I like it!
Then again, a couple years ago one of the CRM aficionados was shouting the same at the CIDOC congress.
The problem is not to shout it, but to do it in a collaborative way (the way you described).
Cheers!