Skip to content

Instantly share code, notes, and snippets.

@wragge
Last active September 27, 2021 03:31
Show Gist options
  • Save wragge/cf8eb2309f89b36d8d37402621d1a3b4 to your computer and use it in GitHub Desktop.
Save wragge/cf8eb2309f89b36d8d37402621d1a3b4 to your computer and use it in GitHub Desktop.
Feedback on the draft plan for Trove advanced research tools under the HASS Research Data Commons program

Feedback on the draft plan for Trove advanced research tools

Submitted by Dr Tim Sherratt (GLAM Workbench and University of Canberra)

Do you think this proposal meets the requirements?

I’ve provided more detailed feedback on the proposal in this post.

Unfortunately I don’t think this proposal meets the requirements. A comparison with the other project plans is quite stark. While the other plans will establish collaborative processes, involving multiple partners, that will deliver a range of outcomes, this plan proposes to develop a small number of products with minimal consultation. No justification is given for prioritising these products; no indication is provided as to how they might integrate with other research infrastructure; no strategy is outlined for sustaining these products; no explanation is offered as to how this plan could help build a framework for future infrastructure development and collaboration. My concern is that this will be a ‘one off’ project that fails to connect with both the research community and existing infrastructure efforts. This risks setting back the current program of HASS research infrastructure development.

Given that the range of outputs is limited, I’d expect there to be more detail about what’s actually proposed, but there are a lot of gaps and uncertainties that make it difficult to assess.

The draft plan indicates that authentication of users through the Australian Access Federation is preferred. Why? Trove already has a system for the creation of user accounts. Using AAF would limit use of the new platform to those attached to universities or research agencies. This does not seem to align with FAIR data principles, and contradicts the NLA’s own professed aim to open such tools to a wider group of users.

Of key importance to researchers is the nature of the data that is being saved and visualised, but this is skipped over in the plan. When users ‘create’ a dataset, are they storing a timestamped version of the metadata and full text? This is essential for research purposes. When users visualise a search over time, are they visualising the number of search results, or the number of times a keyword appears? What geospatial data will be visualised, and how was it created? Researchers can’t take data for granted – they need to be able to engage critically with the processes through which it is generated, enriched, and analysed. The plan offers no space for this sort of engagement.

Disappointingly, the plan fails to make use of any existing tools, such as those provided through the GLAM Workbench. Researchers have been visualising newspaper searches over time for the past 10 years using versions of QueryPic. What does this plan add to what’s already available? The plan claims that it will create opportunities for non-technical users to engage with Trove data, but QueryPic, the Trove Newspaper Harvester, and other GLAM Workbench tools already exist as simple web apps. If these new tools are intended as an introduction to non-technical users, where are the pathways and integrations with other services that would enable users to develop their skills, and build research workflows? Again, this seems like a standalone project, that makes little effort to connect either with existing infrastructure, or the needs of researchers.

How would this proposal support your research?

It’s difficult to see how this proposal would add anything to what I am currently able to do with Trove data.

I can, for example, already create datasets from the Trove API using the Trove Newspaper Harvester. These datasets are generated using openly-licensed tools, running on a variety of cloud services. The datasets are timestamped, and can include the full text, PDFs, and images as well as article metadata. Other notebooks in the GLAM Workbench harvest data from Trove books and journals.

I can share these datasets with either selected partners or the public using CloudStor. I already share several gigabytes of metadata, text, and images harvested from Trove via CloudStor.

I can visualise these datasets using a variety of notebooks in the GLAM Workbench, as well as other open source tools, such as Voyant. Most importantly, these tools offer the opportunity to ask critical questions of the data itself.

It’s not clear from the plan whether there will be any enrichment of the existing data. If the data was enriched, and made available through the public API, this would certainly support new research uses.

Are there other capability gaps that we should consider in the longer term?

I think the primary role of the NLA in the development of this research platform should be as the data provider. There are numerous ways in which Trove’s data might be improved and enriched in support of new research uses. These improvements could then be pushed through the API to integrate with a range of tools and resources. Which raises the question — where is the API in this plan?

The only mention of the API comes as an option for a user with ‘high technical expertise’ to extend the analysis provided by the built-in visualisations. This is all backwards. The API is the key pipeline for data-sharing and integration and should be at the heart of this plan. This program offers an opportunity to make some much-needed improvements to the API. Here’s a few possibilities:

  • Bring the web interface and API back into sync so that researchers can easily transfer queries between the two (Trove’s interface update introduced new categories, while the API still groups resources by the original zones).
  • Provide public API access to additional data about digitised items. For example, you can get lists of newspaper titles and issues from the API, but there’s no comparable method to get titles and issues for digitised periodicals. The data’s there – it’s used to generate lists of issues in the browse interface – but it’s not in the API. There’s also other resource metadata, such as parent/child relationships, which are embedded in web pages but not exposed in the API.
  • Standardise the delivery of OCRd text for different resource types.
  • Finally add the People & Organisations data to the main RESTful API.
  • Fix the limitations of the web archives CDX API ( documented here ).
  • Add a search API for the web archives.
  • Make use of IIIF standard APIs for the delivery of images and maps. This would enable the use of the growing ecosystem of IIIF compliant tools for integration, analysis, and annotation.
  • And what about a Write API? Integration between components in the HASS RDC would be greatly enhanced if other projects could automatically add structured annotations to existing Trove resources.

I think the HASS RDC would benefit greatly by thinking much more about the role of the Trove API in establishing reusable data flows, and connecting up components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment