Skip to content

Instantly share code, notes, and snippets.

@eshellman
Last active August 29, 2015 14:17
Show Gist options
  • Save eshellman/e1c13ecb97b0da1e4327 to your computer and use it in GitHub Desktop.
Save eshellman/e1c13ecb97b0da1e4327 to your computer and use it in GitHub Desktop.

GITenberg metadata

Part 4. Metadata that’s needed, but missing. Also, covers.

In Part 3, I looked at the metadata that’s available via api from other metadata sources. But there’s a bunch of metadata internal to GITenberg (and to a lesser extent, Project Gutenberg) that should probably be included in a GITenberg metadata file.

Housekeeping data

  • the Repo URL. For Space Viking, that would be https://github.com/GITenberg/Space-Viking_20728 Unfortunately, because of unicode and OS level issues, it’s not as simple as you might think to derive the url from other metadata. And maybe for portability, we should just keep "Space-Viking_20728"

  • Download URLs. Well maybe not. It probably makes more sense to have a resolver service separate from the repo.

  • version info. Again, maybe not. It probably makes sense to use Github’s versioning to keep track of this. On the other hand, downstream sites will need to know this stuff. But a MARC record builder could grab version info directly from Github rather than a version-controlled metadata file.

  • process/toolchain info. This requirement is best understood with an example. We have a tool that changes ascii quotes into typographic quotes. It might not be obvious from inspection of a file that this process has been completed. One way to record this would be to add something like "quotification_completed: March 23, 2015" to the metadata file. I’m not sure if this would be the best way to do this.

Cover data

This could probably be a post on its own, but covers, and to a lesser extent, illustrations, present a set of new issues when combined with a version control system. The traditional view in the book world is that a book may have many editions (or manifestations), but each edition has a single cover associated with it. For public domain books, it’s common for tens, hundreds, or even thousands of editions to exist, each with its own cover. The text inside might be the text from Gutenberg, perhaps with some editorial enhancement. If GITenberg succeeds, this will no doubt continue, but with improved quality all around.

What’s changed with public domain digital books, is that there’s no need for a "manifestation" to fix a single cover. For some purposes, a scan of the original printed cover might be the best choice for a cover image. For other purposes, a branded, generated cover might do a better job of communicating to a potential reader what’s "behind" the cover. Projects such as Recovering the Classics are creating new composite works by letting artists and designers breathe new life into covers for old books. And it’s simple to swap in an alternate cover on-demand during a digital download.

The consensus in the GITenberg core team has been that we need to make room for different types of covers. Each cover included in a repo will need its own metadata- license, cover type, attribution, and perhaps other things. And there needs to be available for every book an attractive public domain or CC0 cover that works well in a digital display.

Source data

Most PG texts I’ve seen have some indicators of the production process. For example, in the text of PG’s Space Viking there is:

Produced by Greg Weeks, William Woods and the Online
Distributed Proofreading Team at http://www.pgdp.net

[Transcriber's note:
This etext was produced from Analog Science Fact--Science Fiction
November 1962, December 1962, January 1963, February 1963.
Extensive research did not uncover any evidence that the copyright
on this publication was renewed.]

These notes and credits should be moved to the metadata, and the ebook builder should present these notes appropriately. And bibliographic data for the original edition should get recorded, if available. I can’t imagine that there isn’t a worldcat record for 99% of the source editions in PG; these ought to be identified and added somehow.

  • Much better provenance, including links to DP projects, scanned source files, Internet Archive mirrors, etc would be useful metadata to add (from Tom Morris)

Quality data

A library, bookseller or other sort of distributor should be able to select based on the quality of the ebooks available. A set of objective quality criteria would be useful. Here are some examples:

  • passes epubcheck

  • includes a clickable toc/index

  • includes semantic markup

Relations within PG

(from Tom Morris)

  • PG has multiple editions of many works and often the later ones are of higher quality than the older "editionless" editions, yet the earlier ones get downloaded way more. Enhancing the bibliographic information to help with this issue would be useful to readers. For example, this early editionless Pride & Prejudice http://www.gutenberg.org/ebooks/1342#bibrec is downloaded over 30 times more often than this later high quality transcription of the 1932 R.W. Chapman edition http://www.gutenberg.org/ebooks/42671#bibrec donated by Distributed Proofreaders. #76 & #32325 are another example. It would be good to be able to link the various editions together.

External IDs

In prior chapters we talked about external sources of metadata. We should include ids for:

  • OpenLibrary

  • OCLC

  • Google Books

  • Wikipedia

  • Goodreads

  • LibraryThing

  • Unglue.it

  • ManyBooks

  • FeedBooks

More

I’m sure there are types of metadata I haven’t thought of! Suggestions and comment are both welcomed and needed.

In the next chapter, I’ll discuss target vocabularies for the metadata.

@sethwoodworth
Copy link

re: external ids Do all of these orgs have a unique ID for a book that is guaranteed not to change?

Either way, I would very much like an index of Librivox recordings that are related to a given work in gitenberg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment