Skip to content

Instantly share code, notes, and snippets.

@mhermans
Created May 27, 2009 15:19
Show Gist options
  • Save mhermans/118692 to your computer and use it in GitHub Desktop.
Save mhermans/118692 to your computer and use it in GitHub Desktop.

From a reddit-comment:

stupid question on "semanticweb". How do I actucally get data? It says library of congress is on 'link-web-data' now. If I want to get a book name by ISBN (using LOCs 'linked data') how would I do that?

Is there a website for the standardize format of link-data? Are there APIs available?

Also how do I cross correlate link-data? Say Amazon also had a link data set (or other "publisher"). How do I correlate ISBN numbers between Amazon, LOCs, the patent office, etc... to verify the integrity of such data. Lots of stuff on goggle is inaccurate, but that is "ok" because people are verifying it. But with an application, you need a way to insure the data is correct and what you are actucally looking for.

How do I get the data?

Option 1: datadump

The different data-providers usually provide a data-dump (eg. DBpedia, LOC headers). This means loading it in a triplestore and manipulating/querying it yourself (see below).

Option 2: explore/browse

Some publishers provide endpoint-specific browsers. For instance, with DBPedia's browser or Dataincubator's Linked Periodicals browser, you have a spartan but functional interface to search and browse around.

There are also generic browsers link Tabulator, Disco, Marbles or Zitgist. These can aggregate links/data across the linked data-cloud while browsing.

Tabulator-example.

Option 3: query

Free text search. Freebase rdf interface. BBC data? Sindice

Most endpoints provide a SPARQL-endpoint.

Example SPARQL query (online)

Example SPARQL query (command line)

roquet is part of librdf

install with apt-get install redland-utils

$roquet DESCRIBE

Example SPARcool query

Why is this not more intuitive?

Publishing and interlinking (large) datasets is (see cloud 1, cloud 2). It is not yet "polished" enough for end-users to get what they want with a simple search. An current initiative is to describe the using Void (Vocabulary...) to describe the contents of datasets, the URI's to use, etc. My guess is that interfaces will start to use this information to provide a more streamlined process.

What format can I expect?

Linked data shared a common datamodel (RDF), but is serialized in different syntaxes. The most common are RDF/XML, Turtle and ntriples. RDF/XML is XML-based, and is the de facto standard for interchange. Turtle is more human-readable and suited for manual authoring. Ntriples is the "raw" dump, where for instance namespaces are not abbreviated, etc.: useful for debugging strange behavior.

These tree are pretty universally supported by services and tools, and you can roundtrip between them, either using local tools or webservices (morph, babel, triplr, ...

A second set of formats you can come into contact with is when you use SPARQL. A SPARQL-query returns a fixed resultset in either XML or JSON. The two first examples are the standard, and will likely be the format that you application will consume.

Examples

Example with triplr

Surf to http://

Command line example with rapper

# rapper is part of [librdf]()
# Install with apt-get install redland-utils

$rapper -i turtle -o rdfxml some_turtle_file.ttl > some_rdfxml_file.xml

API-example with rdflib

# Python packge rdflib
# install with easy_install rdflib

from rdflib import Graph
g = Graph()
g.load() # file in RDF/XML-format
g.load() # file in Turtle-format
len(g)
>>> X # number of triple loaded)
g.serialize(format="ntriples") # serialize in Ntriples format
g.serialize(

SPARQL query results:

SPARQL-XML results

SPARQL-JSON results

How do I cross-link the data?

How can I verify the data?

This depends on what you mean by "verify the integrity of the data".

Verify the data is complete/untampered

Compare with md5-checksumming of data, etc. This is a planned but currently underdeveloped area. XML-sig. Difficulty is that for instance the RDF/XML-serialization is not fixed/ordered;

Verify that data is in line with previously received data

This is equivalent with e.g. receiving and handling additional data through REST or SOAP.

Verify that data is not contradictory

How to store the data?

See triplestores. Most are a combination of store, API, query-interface, etc. The Java-based ones (Jena, Sesame) are imho the most mature, ARC (PHP) is the most user-friendly to start with. rdflib (Python) is pretty decent.

How to handle the data?

Data-wrangeling

Display

Exhibt is a very nice "ajax"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment