Created
February 17, 2011 16:32
-
-
Save nichtich/832052 to your computer and use it in GitHub Desktop.
Make use of VIAF authority records
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/perl | |
=head1 NAME | |
viaflookup.pl - How to make use of VIAF authority records | |
=head1 VERSION | |
Version 0.2 - 2011-02-18 | |
=cut | |
use strict; | |
use LWP::Simple; | |
use Data::Dumper; | |
use CGI qw(escape param header); | |
use JSON; | |
use Carp; | |
=head1 DESCRIPTION | |
The L<Virtual International Authority File|http://www.viaf.org> (VIAF) | |
combines authority files from more than a dozen libraries and countries. | |
This script implements and describes the use of VIAF API. See also the | |
L<developer documentation|http://www.oclc.org/developer/services/viaf> | |
provided by OCLC. To best make use of VIAF, you should be familiar with | |
its basic concepts: authority schemes, authority agencies, authority | |
records, and identfiers. Some L<basic notes about RDF|/RDF> are given below. | |
=head2 Schemes | |
In VIAF B<authority files> are called "authority schemes". An authority | |
file or scheme is a collection of controlled L<authority records|/Records> | |
that uniquely identify a person, institution, or another concept. Some also | |
call this an "Knowledge Organization System". Authority files are used in | |
library institutions since more than a century. However, these schemes have | |
little been connected among one another. This is where VIAF comes into play. | |
VIAF connects records from several authority schemes. In VIAF each scheme | |
is identified by a scheme identifier, for instance C<DNB> for the German | |
"Gemeinsame Normdatei" (GND). Most identifiers are uppercase letters, but | |
it also occurrs in lowercase at some places and some identifiers mix uppercase | |
and lowercase, so there seems to be no strictly unique form. VIAF defines an | |
URI for each scheme, based on its identifier, for instance | |
L<http://viaf.org/authorityScheme/DNB>. The scheme identifier is also used | |
to identify L<agencies|/Agencies>. This mixing of concepts can be seen as | |
bug or a feature, but you need to take care when using VIAF. | |
For each scheme there is a little icon that can be accesed by its identifier, | |
for instance L<http://viaf.org/viaf/images/flags/NLIheb.png>. | |
=head2 Agencies | |
An agency is some organization that publishes at least one authority scheme. | |
Most agencies in VIAF only provide one scheme, so they are just identified | |
by their scheme's identifier. For instance C<DNB> for the German National | |
Library. Institutions that provide more than one scheme, have a different | |
identifier, for instance C<NLI> for the National Library of Israel. It is | |
assumed that VIAF defines an URI for each agency, for instance | |
L<http://viaf.org/authorityAgency/DNB> but these URIs and the connection | |
between agencies and their schemes are not pulished in RDF yet. | |
A list of participating institutions in VIAF can be found at the VIAF | |
homepage L<http://viaf.org/> in HTML. For each agency there is an icon | |
as well, for instance L<http://viaf.org/viaf/images/flags/RERO.png>. | |
=head2 Records | |
Authority records are identified by their scheme and a local identifier | |
within this scheme. VIAF combines both parts to form one identifier, but | |
there are several forms: | |
=over | |
=item Simple string form | |
Scheme identifier and local identifier, combined with a vertical bar. | |
For instance C<LC|n 50034593> identifies a library of congress name | |
authority record. | |
=item Processed string form | |
Scheme identifier and normalized local identifier, combined with a | |
vertical bar. For instance C<LC|n 50034593>. Normalization rules | |
depend on the particular authority scheme. | |
=item URI forms | |
The simple string form is used to define several URIs. However these | |
URI are not suitable as permanent linked data URIs because of problems | |
with encoding of characters that don't belong in URIs and missing | |
content negotiation. For each record there are at least the following | |
URIs (given as example for the record C<LC|n 50034593>. | |
=over | |
=item L<http://viaf.org/processed/LC%7Cn%2050034593> | |
A representation of the source record that VIAF used for mapping | |
(in MARCXML format). | |
=item L<http://viaf.org/processed/LC%7Cn%2050034593#skos:Concept> | |
The authority record. | |
=item L<http://viaf.org/viaf/sourceID/LC%7Cn%2050034593> | |
A HTTP 302 redirect to the mapped VIAF record. | |
=back | |
Most Authority files should define their own, clean, strict, and | |
resolveable URIs for authority records. If there is such an URI, | |
you should be able to construct it from the local identifier. | |
Depending on the identifier structure, the institution may need to | |
define some normalization, for instance as described here for LCCN: | |
L<http://www.loc.gov/marc/lccn-namespace.html#normalization> | |
For instance L<http://d-nb.info/gnd/118540475> is a much better | |
URI than L<http://viaf.org/processed/DNB%7C118540475#skos:Concept>. | |
=back | |
VIAF records are just one special kind of authority records, that | |
contain mappings to other authority records. You can get VIAF records | |
in different formats (VIAF-XML, MARCXML, UnimarcXML, RDF, JSON). | |
=cut | |
# agencies and schemes | |
my $schemes = { | |
BAV => { | |
name => 'Vatican Library', | |
}, | |
BNE => { | |
name => 'Biblioteca Nacional de España', | |
}, | |
BNF => { | |
name => 'Bibliothèque nationale de France', | |
records => 'http://catalogue.bnf.fr/ark:/12148/cb$1' #t | |
}, | |
DNB => { | |
name => 'Gemeinsame Normdatei', | |
records => 'http://d-nb.info/gnd/$1' | |
}, | |
EGAXA => { | |
name => 'Bibliotheca Alexandrina', | |
}, | |
ICCU => { | |
name => 'Italian National Catalog', | |
}, | |
JPG => { | |
name => 'Getty ULAN', | |
}, | |
JPGRI => { | |
name => 'Getty Research Institute', | |
}, | |
LAC => { | |
name => 'Library and Archives Canada', | |
}, | |
LC => { | |
name => 'Library of Congress Authorities', | |
short => 'LOC', | |
# see http://www.loc.gov/marc/lccn-namespace.html#normalization | |
filter => sub { | |
s/ |\/.*//g; # remove all blanks and characters after forward slash | |
if ( $_ =~ /^([^-]+)-(.*)$/ and length($2) < 6 ) { | |
return $1 . ('0'x(6 - length($2))) . $2; | |
} else { | |
return $_; | |
} | |
}, | |
pattern => qr/^([a-z]*\d+)$/, | |
records => 'info:lccn/$1' | |
}, | |
NKC => { | |
name => 'National Library of the Czech Republic', | |
}, | |
NLA => { | |
name => 'National Library of Australia', | |
}, | |
NLI => { | |
name => 'National Library of Israel', | |
}, | |
NLIara => { | |
name => 'National Library of Israel', | |
}, | |
NLIcyr => { | |
name => 'National Library of Israel', | |
}, | |
NLIheb => { | |
name => 'National Library of Israel', | |
}, | |
NLIlat => { | |
name => 'National Library of Israel', | |
}, | |
NSZL => { | |
name => 'National Széchényi Library (Hungary)' | |
}, | |
NUKAT => { | |
name => 'NUKAT, Poland' | |
}, | |
PTBNP => { | |
name => 'Biblioteca Nacional de Portugal', | |
}, | |
RERO => { | |
name => 'RERO (Switzerland)' | |
}, | |
SELIBR => { | |
name => 'National Library of Sweden', | |
records => 'http://libris.kb.se/auth/$1' | |
}, | |
SWNL => { | |
name => 'Swiss National Library', | |
}, | |
VIAF => { | |
name => 'Virtual International Authority File', | |
uri => 'http://viaf.org/viaf/$1/', | |
}, | |
}; | |
=head2 Making use of VIAF | |
VIAF provides a large amount of information. Some typical queries are: | |
=over | |
=item Find authority records for a person | |
Given a name you want to know whether and which authority records | |
exist, so you can create links to an authority. Linking to authorities | |
is best practice in cataloging, so this is an important query. | |
In VIAF you can either search by name per SRU or per a simple REST | |
API. To only find authority records you better use the latter. Here | |
is an example query: | |
L<http://viaf.org/viaf/AutoSuggest?query=Emma%20Goldman> | |
The result is a JSON document that echoes the normalized C<query> and | |
gives a (possibly empty) ordered list of VIAF records as C<result>. Each | |
VIAF record contains the full name of a person as C<term> and local | |
authority record identifiers. The scheme is used in lowercase. | |
=cut | |
use LWP::UserAgent; | |
use HTTP::Request::Common; | |
my $ua = LWP::UserAgent->new; | |
# my $format = param('format'); # TODO: seealso, rdf, etc. | |
my $search = param('search') || ""; | |
$search =~ s/\n\r//; | |
my $suggest = 0; #param('suggest'); # TODO | |
my $id = 0; #param('id'); | |
print header('text/plain; charset=UTF-8'); | |
binmode *STDOUT, ":utf8"; | |
if ($search) { | |
my @clusters = searchName( $search ); | |
# print "$search\n"; | |
foreach (@clusters) { | |
print $_->condensed . "\n"; | |
} | |
} elsif ($suggest) { | |
# search for name | |
my $url = 'http://viaf.org/viaf/AutoSuggest?query=' . escape($suggest); | |
# print "URL:$url\n"; | |
my $json = decode_json(get($url)); | |
if ( $json && $json->{result} ) { | |
#print Dumper($json); | |
foreach (@{$json->{result}}) { | |
handle_record ($_); | |
} | |
} | |
} elsif($id) { | |
if ($id =~ /^([A-Za-z]+)$/) { | |
# TODO: get information about a scheme or agency | |
} elsif ($id =~ /^(VIAF\|)?(\d+)$/) { | |
my $url = "http://viaf.org/viaf/$id"; | |
# TODO | |
} elsif($id =~ /^([A-Za-z]+)[|:](.+)$/ and $schemes->{$1}) { | |
my $url = 'http://viaf.org/viaf/sourceID/'.escape("$1|$2"); | |
my $request = HTTP::Request->new( GET => $url, [ ] ); | |
my $response = $ua->request( GET $url, ['Accept'=>'application/rdf+xml'] ); | |
# http://viaf.org/viaf/sourceID/LC%7Cn%2050034593 | |
} else { | |
#print STDERR "Unknown id format\n"; | |
} | |
} | |
sub handle_record { # FIXME | |
my $r = shift; | |
my @keys; | |
foreach my $prefix (keys %$r) { | |
next if $prefix eq 'term'; | |
my $local = $r->{$prefix}; | |
$prefix = uc($prefix); | |
print "$prefix|$local"; | |
if ( $schemes->{$prefix} && $schemes->{$prefix}->{records} ) { | |
my $uri = $schemes->{$prefix}->{records}; | |
my $pattern = $schemes->{$prefix}->{pattern} || qr/^(\d+)$/; | |
if ($local =~ $pattern) { | |
my ($a,$b) = ($1,$2); # TODO: $3, $4, ... | |
$uri =~ s/\$1/$a/; | |
$uri =~ s/\$2/$b/; | |
print " = $uri"; | |
} | |
} | |
print "\n"; | |
} | |
print "\n"; | |
} | |
=head2 searchName | |
Search for a name in VIAF. Internally this method performs an SRU Query. | |
Returns a (possibly empty) list of up to 10 L<VIAF::Cluster> records. | |
=cut | |
sub searchName { | |
my $name = shift; | |
$name =~ s/['"\\]//g; | |
# retrieve response in VIAF-XML. Alternatively we could use RDF/XML | |
my $url = "http://viaf.org/viaf/search?version=1.1&operation=searchRetrieve" | |
. "&maximumRecords=10&httpAccept=text/xml" | |
. "&query=" . escape("local.personalNames all \"$name\""); | |
eval { use XML::XPath; }; | |
croak "Missing XML::XPath module to parse SRU response" if $@; | |
my $xml = get($url); | |
#my $fh; open ($fh, "<", "viaf.xml"); | |
#my $xml = join("\n",<$fh>); | |
my $xpath = XML::XPath->new( xml => $xml ); | |
$xpath->set_namespace('v','http://viaf.org/viaf/terms#'); | |
my @clusters; | |
foreach my $cluster ( $xpath->findnodes('//v:VIAFCluster[v:nameType="Personal"]') ) { | |
my $type = $xpath->findvalue('v:nameType', $cluster); | |
my $id = $xpath->findvalue('.//v:viafID', $cluster); | |
my $term = $xpath->findvalue('(v:mainHeadings//v:text)[1]', $cluster); | |
my $c = VIAF::Cluster->new( viaf => $id, term => $term ); | |
# TODO: extract link to WorldCat Identities, Wikipedia, and DBPedia... | |
foreach my $source ( $xpath->findnodes( './/v:source', $cluster ) ) { | |
my $id = $source->string_value(); | |
if ( $id =~ /^([A-Za-z]+)\|(.+)$/ ) { | |
$c->add($1,$2); | |
} | |
} | |
push @clusters, $c; | |
} | |
return @clusters; | |
} | |
# Example: | |
# http://viaf.org/viaf/39377930/ | |
# http://www.worldcat.org/wcidentities/lccn-n50-34593 | |
# http://wikipedia.org/wiki/Emma_Goldman | |
# http://dbpedia.org/resource/Emma_Goldman | |
#package VIAF; | |
package VIAF::Cluster; | |
use Scalar::Util qw(refaddr); | |
sub new { | |
my $class = shift; | |
my $self = bless { @_ }, $class; | |
return $self; | |
} | |
sub add { | |
my ($self,$prefix,$id) = @_; | |
unless ( $schemes->{$prefix} ) { | |
foreach my $key ( keys %$schemes ) { | |
next unless lc($key) eq lc($prefix); | |
$prefix = $key; | |
last; | |
} | |
} | |
my $scheme = $schemes->{$prefix} || return; | |
# TODO: normalize id | |
$self->{$prefix} = $id; | |
} | |
sub uri { | |
my $self = shift; | |
return "http://viaf.org/viaf/" . $self->{viaf} if $self->{viaf}; | |
} | |
sub bnode { | |
my $self = shift; | |
return refaddr($self); | |
} | |
sub uri_nt { | |
my $self = shift; | |
return $self->{viaf} ? "<".$self->uri.">" : "_:b".$self->bnode; | |
} | |
sub condensed { | |
my $self = shift; | |
my $string = join( ";", | |
map { uc($_)."|".$self->{$_} } | |
grep {$_ ne 'term'} keys %$self | |
); | |
$string .= " = " . $self->{term} if $self->{term}; | |
return $string; | |
} | |
1; | |
=head2 NOTES | |
=head3 RDF | |
VIAF and authority files do not depend on RDF, but RDF is a good technology | |
to make use of authority data. The basic ontology for authority schemes is | |
the L<Simple Knowledge Organization System|http://www.w3.org/2004/02/skos/> | |
(SKOS). The core concepts of VIAF are mapped to the following parts of SKOS: | |
=over | |
=item Schemes | |
L<http://www.w3.org/2008/05/skos#Scheme|skos:Scheme>. | |
=item Agencies | |
... | |
=item Records | |
L<http://www.w3.org/2008/05/skos#Scheme|skos:Concept>. | |
=back | |
The current RDF representation of VIAF data uses the outdated version | |
of SKOS ontology with namespace L<http://www.w3.org/2004/02/skos/core#>. You | |
should replace all SKOS classes and properties by their counterpart from the | |
new SKOS ontology with namespace L<http://www.w3.org/2008/05/skos#>. | |
In addition to SKOS, VIAF defines its own ontology that is located at | |
L<http://viaf.org/>. To access RDF data from this and other URIs, you need | |
to send a HTTP request with a special C<Accept> header to tell the server | |
that you want no HTML page but RDF data. Most RDF tools do this for you. | |
I recommend to command line tool C<rapper>. | |
=head1 AUTHOR | |
Jakob Voss C<< <[email protected]> >> | |
=head1 LICENSE | |
Copyright (C) 2011 by Verbundzentrale Goettingen (VZG) and Jakob Voss | |
This library is free software; you can redistribute it and/or modify it | |
under the same terms as Perl itself, either Perl version 5.8.8 or, at | |
your option, any later version of Perl 5 you may have available. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The VIAF developer documentation has been moved, it seems - now available at
http://www.oclc.org/developer/develop/web-services/virtual-international-authority-file-viaf.en.html
The OCLC has also published a nice developer handbook, available:
http://www.oclc.org/developer/develop/web-services.en.html
The developer handbook does not denote VIAF, but maybe it could be useful towards using the OCLC web APIs.
VIAF in the OCLC API Explorer:
https://platform.worldcat.org/api-explorer/VIAF
It seems that the "Jane Ausitin" resource ID from the example in the API explorer has been updated,
Regarding the SRU syntax used in the API SRUSearch function:
http://www.loc.gov/standards/sru/
Raw VIAF data, in RDF, MARC-21, and plain text formats:
http://viaf.org/viaf/data/