bnewbold

Do the whole thing:

PYTHONPATH='.' luigi --module small Small --local-scheduler

I've been implementing a dat client in Rust: https://github.com/bnewbold/geniza

It's been fun! The "whitepaper"/spec has been very helpful. Below are a few thoughts/comments on the paper, documentation, and protocols.

Informal Protocol Proposals

With my archival and inter-op hat on, I wish that the hyperdrive metadata register (specifically Node protobuf messages) included standard full-file hashes (eg, SHA1 or BLAKE2b of the entire file, with no length prefix). These could be optional, but could presumably be calculated when adding files to archives with little overhead. This could make auditing, verification, and interoperability between distributed networks easier. Storage and compute overhead would be non-zero.

It seems like the network protocol really should have a version field... in the initial pre-encryption Register message?

RE: https://phabricator.wikimedia.org/T223528

The images of the National Library of Aruba are on the internet archive (https://archive.org/details/bibliotecanacionalaruba). Can you get their images from their site (currently 1.017) and put them (urls) + all the metadata into OpenRefine or CSV file so I can prepare them for upload to Wikimedia Commons?

First, you need the following command line tools installed:

Python 3
the internetarchive python command-line tool ("ia"): https://github.com/jjjake/internetarchive
GNU/Parallel ("parallel").

fixed_issnl	issnl	fatcat_ident	name
1530-1311	1550-1311	2nklacmgkjdfjib7kqi3hdh76a	International Symposium on Temporal Representation and Reasoning/TIME, Proceedings of the
2306-0441	2223-0441	rngvdeed65ffhmgfxti7s2z6by	Journal of Local and Global Health Science
1641-6554	1641-6565	yviicehmubf4bcxo23q43sbkzu	Kolposkopia
1526-7539	1526-7639	pnyefvclqfabjlnl4suox6pdte	International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Proceedings of the
0276-6574	0276-6547	hcrk2xeoknf7daxwurje2vg3n4	Computers in Cardiology Conference
0018-9413	0359-4237	ngthxcdwgzhovfgimtll6owdnm	IEEE Transactions on Geoscience Electronics
	2630-4301	drgmggxvfjbjjkakrjegaujkgy	Food Modelling Journal
1089-7771	1086-7771	sjjfcknh3zawndw6jdmvfrix7a	IEEE transactions on information technology in biomedicine
1093-1139	1039-1139	nnrvd2qmhzbebk2hsopvtchodq	Academic Physician and Scientist

Can experiment with GROBID affiliation extraction (raw affiliation string to structured object) at, eg, http://grobid.qa.fatcat.wiki/. Go to "TEI", then select "Process Affiliations".

GROBID dataset here is from millions of research papers found on the web by wayback crawlers. The structured affiliations were gathered into a dataset available at https://archive.org/details/ia_research_affiliation_datasets.

The bioxriv dataset (of top 1000 institutional affiliations, mostly

This gist contains hack-y python scripts that pull registration content (as JSON) and any files and wiki pages into a directory structure. One then runs a bagit script and mixes in metadata to get a BagPack.

NOTE: the bagit script I used wasn't BagPack-aware, so it didn't actually include the files under ./metadata/ in the manifests. Also all the bagit metadata is just defaults; these examples are just to show the "shape" of the results.

Check how example items show up on archive.org in this collection: https://archive.org/details/cos-dev-sandbox

Thoughts

Description HTML is probably going to link to any child registrations (items). Should also link back to, at least, collection page.


	#
	# NOTE: This code is actually not used in PyX.jl right now, it's just here as
	# an example.
	#
	# This file contains a recursive version of the pywrap() function from PyCall:
	# it will generate nested Julia Modules for nested Python modules.
	# This extends to recursive or infinitely looping modules. For example, the
	# Python "os" module goes infinitely deep:
	#

	var hyperdrive = require('hyperdrive')
	var Dat = require('dat-node')
	var datHttp = require('dat-http')

	var storage = datHttp('https://static.bnewbold.net/tmp/dummy-dat/')

	console.log('Starting...')
	Dat(storage, function (err, dat) {
	if (err) throw err
	console.log('Writable?', dat.writable)

	#!/usr/bin/env python3

	"""
	Depends on:
	- articlemetaapi

	Refs:
	- https://github.com/scieloorg/articlemetaapi/blob/master/articlemeta/client.py
	- https://github.com/scieloorg/xylose/blob/master/xylose/scielodocument.py
	"""

	Monday March 23 french onion soup and butter miso carrots
	Tuesday March 24 pasta al pesto genovese e asparago
	Thursday April 2 Coronation chickpeas and yogurt tomatoes
	Friday April 3 queso fondido
	Sunday April 5 Sun-dried tomato almond pesto pasta
	Monday April 6 gua bao! (割包)
	Wednesday April 8 okonomiyaki
	Friday April 10 broccoli pesto pasta + burrata salad
	Sunday April 12 veggie reuben
	Thursday April 16 pipian pascal