Skip to content

Instantly share code, notes, and snippets.

@bitsgalore
Last active November 14, 2024 23:07
Show Gist options
  • Save bitsgalore/9eac23e44d6d99b3caa6 to your computer and use it in GitHub Desktop.
Save bitsgalore/9eac23e44d6d99b3caa6 to your computer and use it in GitHub Desktop.

Experimental attempt at getting organized ...

07/11/2024

HTML Text fragments

https://developer.mozilla.org/en-US/docs/Web/URI/Fragment/Text_fragments

Text fragments allow linking directly to a specific portion of text in a web document, without requiring the author to annotate it with an ID, using particular syntax in the URL fragment.

Example:

https://www.bitsgalore.org/2024/10/30/jpeg-quality-estimation-using-simple-least-squares-matching-of-quantization-tables#:~:text=I%20originally%20wrote,heuristic.

Firefox add-on that automatically creates text fragment link for selected text:

https://addons.mozilla.org/en-US/firefox/addon/text-fragment/

Stirling PDF

Your locally hosted one-stop-shop for all your PDF needs.

https://stirlingpdf.io/

Docling

03/11/2024

Docling

  • Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON
  • Advanced PDF document understanding incl. page layout, reading order & table structures

https://ds4sd.github.io/docling/

27/10/2024

Image compression test images

These high-resolution high-precision images have been carefully selected to aid in image compression research and algorithm evaluation. These are photographic images chosen to come from a wide variety of sources and each one picked to stress different aspects of algorithms. Images are available in 8-bit, 16-bit and 16-bit linear variations, RGB and gray.

https://imagecompression.info/test_images/

samplelib

This is a collection of sample files published on https://samplelib.com for easier access & usage.

https://github.com/ffeast/samplelib

11/10/2024

Preserving Digital Art

DPC Technology Watch Guidance Note (2024):

http://doi.org/10.7207/twgn24-02

19/09/2024

Unpack .zst archive

unzstd mingw-w64-x86_64-poppler-24.08.0-1-any.pkg.tar.zst

05/09/2024

Scoop

A command-line installer for Windows:

https://scoop.sh/

How to install poppler on Windows10 ? (step by step)

oschwartz10612/poppler-windows#42

Build + install python-poppler on Windows (conda)

cbrunet/python-poppler#9 (comment)

This fails for me in last step (pip install python-poppler) with a PermissionError on the line:

hp, ht, pid, tid = _winapi.CreateProcess(executable, args,

This also triggers an "action blocked" notification from the Window Security application, which seems to block launching an executable.

26/08/2024

Emulating the early Macintosh floppy drive

https://thomasw.dev/post/mac-floppy-emu/

21/08/2024

docling

Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.

  • ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
  • 📑 Understands detailed page layout, reading order and recovers table structures
  • 📝 Extracts metadata from the document, such as title, authors, references and language
  • 🔍 Optionally applies OCR (use with scanned PDFs)

https://github.com/DS4SD/docling

23/07/2024

FileTrove

FileTrove indexes files and creates metadata from them.

https://github.com/steffenfritz/FileTrove

19/07/2024

Use Markdown formatting in Microsoft Teams

https://support.microsoft.com/en-us/office/use-markdown-formatting-in-microsoft-teams-4d10bd65-55e2-4b2d-a1f3-2bebdcd2c772

Actually this doesn't seem to work at all (except for hyperlinks)!

18/07/2024

6 Steps Towards Reproducible Research

A short book with 6 steps that get you closer to making your work reproducible.

https://zenodo.org/records/12744715

08/07/2024

The Research Data Management Workbook

The Research Data Management Workbook is made up of a collection of exercises for researchers to improve their data management.

https://caltechlibrary.github.io/RDMworkbook/

03/07/2024

Horrifying PDF experiments

https://github.com/osnr/horrifying-pdf-experiments

02/07/2024

diff-pdf

Diff-pdf is a tool for visually comparing two PDFs:

https://github.com/vslavik/diff-pdf

14/06/2024

Version Control of Code and Data

The purpose of this book is to empower scientists, researchers, and students with the knowledge and skills needed to use Git for version control of code and data.

https://lennartwittkuhn.com/version-control-book/

11/06/2024

Disentangling Digital Preservation Risk - An Interdisciplinary Exploration and Solution

PhD thesis Maureen Pennock:

https://discovery.dundee.ac.uk/en/studentTheses/disentangling-digital-preservation-risk

https://discovery.dundee.ac.uk/ws/portalfiles/portal/118389901/Pennock_THESIS-DigitalPreservationRisk.pdf

05/06/2024

Mastodon preview cards

This explains setup on Jekyll site:

https://amytabb.com/til/2022/12/03/mastodon-preview-cards/

https://docs.joinmastodon.org/entities/PreviewCard/

Open Graph protocol:

https://ogp.me/

Also:

Creating Twitter cards on Jekyll websites

03/06/2024

XHTML in EPUB

Section 6.1.2 of EPUB 3.3 spec:

An XHTML content document:

MUST be an [html] document that conforms to the XML syntax.

Referenced section in HTML spec (14 The XML syntax) shows this warning:

Using the XML syntax is not recommended, for reasons which include the fact that there is no specification which defines the rules for how an XML parser must map a string of bytes or characters into a Document object, as well as the fact that the XML syntax is essentially unmaintained — in that, it’s not expected that any further features will ever be added to the XML syntax (even when such features have been added to the HTML syntax).

Consequences for future of EPUB?

29/05/2024

Nanosearch, a Python package for making small search engines

https://jamesg.blog/2024/05/29/nanosearch/

27/05/2024

DWSampleFiles

DWSampleFiles.com provides plethora of different files types and extensions. Download files for testing purposes in many different sizes, bitrates, or resolutions

https://www.dwsamplefiles.com/

22/05/2024

What's so hard about PDF text extraction?

https://web.archive.org/web/20230107081641/https://filingdb.com/b/pdf-text-extraction

18/05/2024

hypercyclic

LFO-driven, midi-mangling arpeggiator:

https://www.mucoder.net/en/hypercyclic/

qmidiarp

https://qmidiarp.sourceforge.net/

Installation:

sudo apt install qmidiarp

NIH-plug

NIH-plug is an API-agnostic audio plugin framework written in Rust, as well as a small collection of plugins.

https://github.com/robbert-vdh/nih-plug/

MIDI example:

https://github.com/robbert-vdh/nih-plug/blob/master/plugins/examples/midi_inverter/src/lib.rs

PyPhonic

VST plugin that enables using Python to process MIDI and audio in the DAW (VST not yet released; only server code):

https://github.com/AudioFluff/PyPhonic

Docs:

https://audiofluff.github.io/PyPhonic/

17/05/2024

Does One Line Fix Google?

https://tedium.co/2024/05/17/google-web-search-make-default/

Accents and eBooks (Unicode)

[M]odern computing gives us two main ways of displaying a letter with an accent. The first is simple - encode every single accented letter as a separate "pre-composed" character. (...)

[T]there is a second way to add accents. You take the base character (...) and then apply a separate "combining" accent character to it.

https://shkspr.mobi/blog/2024/05/accents-and-ebooks/

06/05/2024

PDF 2.0 adds glTF model support

A new ISO extension to PDF 2.0 adds PDF support for the Khronos Group’s glTF 3D format. (...) PDF 2.0 therefore now supports four 3D formats:

https://pdfa.org/pdf-2-0-adds-gltf-model-support/

04/05/2024

TSAC: Very Low Bitrate Audio Compression

https://bellard.org/tsac/

01/05/2024

Big List of Naughty Files

Script to generate troublesome filenames from the big list of naughty strings

https://github.com/ross-spencer/big-list-of-naughty-files

29/04/2024

Just the Docs

A modern, highly customizable, and responsive Jekyll theme for documentation with built-in search. Easily hosted on GitHub Pages with few dependencies.

https://github.com/just-the-docs/just-the-docs

24/04/2024

subprocess.run vs Popen

subprocess.run() is synchronous which means that the system will wait till it finishes before moving on to the next command. subprocess.Popen() does the same thing but it is asynchronous (the system will not wait for it to finish).

Source:

https://stackoverflow.com/a/71896704/1209004

10/04/2024

Analyzing Malicious Documents

Cheat sheet that covers tools, common commands, and other information for analyzing malicious documents, such as Word, OneNote and PDF:

https://www.thecyberyeti.com/_files/ugd/b84265_d6d2f6486f6b41419aa9f1cd34027392.pdf

08/04/2024

nimbie-py

Python driver for acronova's nimbie NB21:

https://github.com/mattsoulanille/nimbie-py

06/04/2024

The Mediocre Programmer

https://themediocreprogrammer.com/

29/03/2024

ARIA

ARIA is an online tool that enables users and building managers to assess the risk of SARS-COV-2 (COVID-19) airborne transmission in residential, public, and healthcare settings. The aim is to inform decisions that can significantly reduce the risk of transmission.

https://partnersplatform.who.int/tools/aria

28/03/2024

How to make Windows 11 more usable, less annoying

https://www.dedoimedo.com/computers/windows-11-usability-guide.html

Particularly interesting - Open-Shell, "a collection of utilities bringing back classic features to Windows" (including Start Menu!):

https://github.com/Open-Shell/Open-Shell-Menu

Okay, Color Spaces

What is a “color space?”

https://ericportis.com/posts/2024/okay-color-spaces/

SSIMULACRA 2 - Structural SIMilarity Unveiling Local And Compression Related Artifacts

Perceptual image quality metric developed by Jon Sneyers (Cloudinary):

https://github.com/cloudinary/ssimulacra2

27/03/2024

Discmaster

Experimental website to browse and search vintage computer files from archive.org:

https://discmaster.textfiles.com/

26/03/2024

Inside PDF

https://news.speedata.de/2024/03/19/insidepdf-01/

25/03/2024

OPF File Format Risk Registry (archive.org)

http://web.archive.org/web/20231002013530/https://wiki.opf-labs.org/display/TR/OPF+File+Format+Risk+Registry

23/03/2024

Sync microsoft office calendar in thunderbird

https://support.mozilla.org/gu-IN/questions/1363441#answer-1471948

21/03/2024

Fdupes

https://github.com/adrianlopezroche/fdupes

Duc

Duc is a collection of tools for inspecting and visualizing disk usage.

https://duc.zevv.nl/

20/03/2024

Digital preservation policies and strategies

https://github.com/digipres/policies/

09/03/2024

ImageMagick "cache resources exhausted error" fix

Processing some largish images, ImageMagick would fail with this error:

cache resources exhausted

A quick search led me to this StackOverflow thread, which explains how it is related to settings in ImageMagick's security policy file. On my machine this is located at /etc/ImageMagick-6/policy.xml. In this file you can set limits to the resources (e.g. memory) ImageMagick is allowed to use, and the defaults are very restrictive. After some fiddling, I managed to make things work by setting the values below:

  <policy domain="resource" name="memory" value="4GiB"/>
  <policy domain="resource" name="map" value="4GiB"/>
  <policy domain="resource" name="width" value="32KP"/>
  <policy domain="resource" name="height" value="32KP"/>
  <policy domain="resource" name="area" value="1GiB"/>
  <policy domain="resource" name="disk" value="4GiB"/>

07/03/2024

Artist_Exhibition-copy (FINAL)(2).mov: Preserving diacritics in filenames as significant properties in media conservation

https://bits.ashleyblewer.com/blog/2019/06/17/artist-exibition-copy-final-2-preserving-diacritics-in-filenames-as-significant-properties-in-media-conservation/

(Include useful list of tools at bottom of post)

04/03/2024

Bash errors (Julia Evans)

https://wizardzines.com/comics/bash-errors/

02/03/2024

Activate NL-Alert on Motorola c139 dumbphone

  • Select "Messages"
  • Select "info services"
  • Go to settings
  • Activate "service" setting
  • Go to "active channels"
  • Add new channel, set value to 919 (source)

22/02/2024

iPRES Index

https://lite.datasette.io/?url=https%3A%2F%2Fraw.githubusercontent.com%2Fanjackson%2Fdigipres-practice-index%2Fmain%2Freleases%2Fpractice-v0.1.0.db#/practice-v0~2E1/publications?_sort=year&_facet=year&_facet=license&_facet=language&_facet=type&_facet_array=keywords&_facet_array=creators&_facet_array=institutions&_facet_size=8

21/02/2024

Voorkeursformaten Overheid

https://www.nationaalarchief.nl/archiveren/kennisbank/voorkeursformaten-overheid

16/02/2024

Magika: AI powered fast and efficient file type identification

https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html

Read/write MS Office formats with Python

Python-docx (docx):

https://python-docx.readthedocs.io/

Python-pptx (pptx):

https://python-pptx.readthedocs.io/

Openpyxl (xlsx):

https://openpyxl.readthedocs.io/

IETF language tags

https://en.wikipedia.org/wiki/IETF_language_tag

IANA Language Subtag Registry

http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

10/02/2024

Layout parser

With the help of state-of-the-art deep learning models, Layout Parser enables extracting complicated document structures using only several lines of code. This method is also more robust and generalizable as no sophisticated rules are involved in this process.

https://layout-parser.github.io/

Stract search engine

Stract is an open source search engine where the user has the ability to see exactly what is going on and customize almost everything about their search results. It's a search engine made for hackers and tinkerers just like ourselves.

https://stract.com/

load xml files into SQLite and transform to json

https://gist.github.com/atomotic/8cdb9f233136eea2ad507bb6940c5c8e

Source: Mastodon.

09/02/2024

The Cost of a Digital Archive

https://lil.law.harvard.edu/blog/2024/02/08/the-cost-of-a-digital-archive/

08/02/2024

READMINE: Suggested template for software READMEs

https://github.com/mhucka/readmine

06/02/2024

Safelinks are a fragile foundation for publishing

https://shkspr.mobi/blog/2024/02/safelinks-are-a-fragile-foundation-for-publishing/

05/02/2024

New GitHub Copilot Research Finds 'Downward Pressure on Code Quality'

https://visualstudiomagazine.com/articles/2024/01/25/copilot-research.aspx

Direct link to whitepaper:

https://gitclear-public.s3.us-west-2.amazonaws.com/Coding-on-Copilot-2024-Developer-Research.pdf

03/02/2024

International Comparison of Recommended File Formats

https://docs.google.com/spreadsheets/d/1XjEjFBCGF3N1spNZc1y0DG8_Uyw18uG2j8V2bsQdYjk/edit#gid=893099148

27/01/2024

fritz.box domain issue

Explained here.

Workaround - added following line to /etc/hosts (this should prevent that the live domain is ever reached):

192.168.178.1	fritz.box

Google search console

Search Console tools and reports help you measure your site's Search traffic and performance, fix issues, and make your site shine in Google Search results

https://search.google.com/search-console?resource_id=https://www.bitsgalore.org/

26/01/2024

Portable EPUBs

This post explores what prevents HTML documents from being portable, and I propose a way forward based on the EPUB format.

https://willcrichton.net/notes/portable-epubs/#epub-content%2FEPUB%2Findex.xhtml$

Preserving iPad apps

https://www.dpconline.org/blog/blog-purnell-ipad-apps

21/01/2024

Modern Plain Text Computing

Over the six weeks of this mini-seminar we will learn some elements of plain-text computing that every graduate student in the social sciences (and beyond!) should know something about.

https://mptc.io/

Digital Production Fundamentals

https://github.com/jmaxsfu/pub607-23

18/01/2024

GPT in 500 lines of SQL

https://explainextended.com/2023/12/31/happy-new-year-15/

A list of known AI agents on the internet

Insight into the hidden ecosystem of autonomous chatbots and data scrapers crawling across the web. Protect your website from unwanted AI agent access.

https://darkvisitors.com/

robots.txt example:

https://darkvisitors.com/robots-txt-builder

17/01/2024

Punycode

Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames.

https://en.wikipedia.org/wiki/Punycode

11/01/2024

Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files

By Elvia Arroyo-Ramirez, addresses, amongst other things, ethical aspects of "sanitizing" file names (original article now semi-paywalled):

https://medium.com/on-archivy/invisible-defaults-and-perceived-limitations-processing-the-juan-gelman-files-4187fdd36759

28/12/2023

Faircamp

A static site generator for audio producers

https://simonrepp.com/faircamp/

26/12/2023

MicroVerb III guide

Includes useful table that maps presets to BPM:

http://notebook.zoeblade.com/MicroVerb_III_guide.html

18/12/2023

The Electronic Connector Book

Includes interactive utilities for identifying and selecting connectors:

https://connectorbook.com/

06/12/2023

Bypass YouTube Anti Ad Block Detection (uBlock Origin)

  1. Open uBlock Origin dashboard
  2. Click "Filter Lists"
  3. Click "Purge all caches"
  4. Click "Update now"
  5. Use this site to check if uBlock Origin bypasses the latest YouTube anti-adblock script

Source: https://www.youtube.com/watch?v=Bhy66w5nVK0

10/11/2023

linux-packages.com

Info on packages for Linux and Unix:

https://linux-packages.com/

Example:

https://linux-packages.com/linux-mint-20-3/package/python3-jpylyzer

16/10/2023

Wikidata: most commonly supported file formats

This query shows a bar chart of the 1000 file formats which have the highest number of supporting applications (applications that can read data in this format).

https://www.wikidata.org/wiki/User:Dipsode87#Most_commonly_supported_file_formats

10/10/2023

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)

Follow-up to Joel Spolsky's classic post:

https://tonsky.me/blog/unicode/

09/10/2023

Microsoft Graveyard - Killed by Microsoft

A full list of dead products killed by Microsoft in the Microsoft Cemetery:

https://killedbymicrosoft.info/

28/09/2023

pdf-differences

The PDF files in this repository are targeted test files highlighting specific issues seen across multiple widely-used implementations.

https://github.com/pdf-association/pdf-differences

14/09/2023

Pagefind

Pagefind is a fully static search library that aims to perform well on large sites, while using as little of your users’ bandwidth as possible, and without hosting any infrastructure.

Pagefind runs after Hugo, Eleventy, Jekyll, Next, Astro, SvelteKit, or any other website framework.

https://pagefind.app/

07/09/2023

Managing packages that are installed from source

How uninstall/remove:

https://askubuntu.com/questions/87111/if-i-build-a-package-from-source-how-can-i-uninstall-or-remove-completely

Mentions CheckInstall tool, which wraps around make installand keeps track of every file modified by this installation. Instructions here:

https://askubuntu.com/a/1278739/1052776

Also, for Cmake installed applications there should be a file install_manifest.txt in the build dir which lists all installed files.

06/09/2023

SciDraw

SciDraw is a free repository of high quality drawings of animals, scientific setups, and anything that might be useful for scientific presentations and posters.

https://scidraw.io/

05/09/2023

pdfgrep

A commandline utility to search text in PDF files:

https://pdfgrep.org/

04/09/2023

PDF Cheat Sheets

To help developers whose relationship with PDF’s specification is casual or tangential, the PDF Association provides free PDF “cheat sheets” to aid in remembering key terms and concepts without constantly referring to ISO 32000.

https://pdfa.org/resource/pdf-cheat-sheets/

30/08/2023

High Throughput JPEG2000

Standard (incl JPH format, which looks largely identical to JP2 + codestream, with some minor deviations:

https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.814-201906-I!!PDF-E&type=items

Evaluating HTJ2K as a Drop-In Replacement for JPEG2000 with IIIF:

https://journal.code4lib.org/articles/17596

29/08/2023

FFmpeg Explorer

A tool to help you explore FFmpeg filters.

https://ffmpeg.lav.io/

17/08/2023

Adobe and Microsoft break some old files by removing PostScript font support

https://arstechnica.com/gadgets/2023/08/microsoft-adobe-and-others-have-dropped-support-for-old-postscript-fonts/

Acrobat not affected as per below statement by Adobe:

https://helpx.adobe.com/fonts/kb/postscript-type-1-fonts-end-of-support.html

16/08/2023

Wavacity

Online version of Audacity:

https://wavacity.com/

Use ImageMagick's built-in test image

Use wizard: (note colon!) as input

convert -quality 40 wizard: wizard-40.jpg

28/07/2023

JACK audio server doesn't run

Error message:

ATTENTION: The playback device "hw:USB" is already in use. Please stop the application using it and run JACK again
cannot load driver module alsa
no message buffer overruns

Works again after this (source):

systemctl --user stop pulseaudio.socket
systemctl --user stop pulseaudio.service

26/07/2023

NapierOne: A Modern Mixed File Data Set

https://github.com/simonrdavies/NapierOne

Data:

http://napierone.com/Website/index.html

20/07/2023

Modern CSV

A Sophisticated CSV Editor/Viewer for Windows, Mac, and Linux

https://www.moderncsv.com/

19/07/2023

Introduction to the command-line interface

https://tutorial.djangogirls.org/en/intro_to_command_line/

05/07/2023

Syncthing

Syncthing is a continuous file synchronization program. It synchronizes files between two or more computers in real time, safely protected from prying eyes.

https://syncthing.net/

Study on quality in 3D digitisation of tangible cultural heritage

https://digital-strategy.ec.europa.eu/en/library/study-quality-3d-digitisation-tangible-cultural-heritage

04/07/2023

Rclone

Rclone ("rsync for cloud storage") is a command-line program to sync files and directories to and from different cloud storage providers.

https://github.com/rclone/rclone

Post by Andy Jackson on Rclone:

https://anjackson.net/2023/07/04/robust-file-transfers-with-rclone/

14/06/2023

Code of Practice for Indoor Air Quality

Health & Safety Authority (IE):

https://www.hsa.ie/eng/publications_and_forms/publications/latest_publications/code_of_practice_for_indoor_air_quality.104818.shortcut.html

31/05/2023

Plain list of all files/directories without attributes

ls -1

Result:

files-origin.md
pdf-hul-106
pdf-hul-109
pdf-hul-133
pdf-hul-137
pdf-hul-138
pdf-hul-154
pdf-hul-36
pdf-hul-4

26/05/2023

What is the point? The motivation for adopting different tools inside the #digitalpreservation workflow…

https://openpreservation.org/blogs/what-is-the-point-the-motivation-for-adopting-different-tools-inside-the-digitalpreservation-workflow/

23/05/2023

Glyph Positions Break PDF Text Redaction

  • Subpixel-sized horizontal shifts in redacted and non-redacted characters can be recovered and used to effectively deredact first and last names.

  • Majority of PDF redaction software tool-kits do not defend against these glyph displacement attacks.

  • In general, redacting a name from a PDF is not secure.

https://arxiv.org/pdf/2206.02285.pdf

16/05/2023

CC-MAIN-2021-31-PDF-UNTRUNCATED

This corpus contains nearly 8 million PDFs gathered from across the web in July/August of 2021. The PDF files were initially identified by Common Crawl as part of their July/August 2021 crawl (identified as CC-MAIN-2021-31) and subsequently updated and collated as part of the DARPA SafeDocs program.

https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

11/04/2023

Get value of XML element (bash)

Example below returns all text values wrapped inside "myElement" elements (also works with namespaces):

xmllint --xpath "//*[local-name()='myElement']/text()" myfile.xml

Result:

myValue

31/03/2023

Semantic File Inspector

This software analyzes the formats of given files and outputs RDF description of their contents.

https://sfi.is4.site/

13/03/2023

Yt-dlp

Fork of youtube-dl (original youtube-dl is still maintained, changes haven't resulted in updated releases for a long time):

https://github.com/yt-dlp/yt-dlp

Convert to audio (FLAC, highest quality) and discard video:

yt-dlp -x --audio-format flac --audio-quality 0 https://www.youtube.com/watch?v=xxxxxxxxxxx

08/03/2023

How to corrupt an archive file in a controlled way?

https://unix.stackexchange.com/questions/222359/how-to-corrupt-an-archive-file-in-a-controlled-way

28/02/2023

Disable OCR in Tika text extraction

By default TApache ika uses OCR of images on text extraction if Tesseract is installed. This can be disabled in the tika.xml config file:

https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr

Location of config file can be given as argument:

https://tika.apache.org/1.9/configuring.html#Using_a_Tika_Configuration_XML_file

So to make this work we create a file "tika-config.xml" with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
  </parsers>
</properties>

Then in Tika use the --config option to set the path to this file:

java -jar ~/tika/tika-app-2.6.0.jar --config=tika-config.xml -t whatever.epub  > whatever.txt

Alternative option is to uninstall Tesseract.

22/02/2023

Teleac - Effectief pc gebruik 1987

https://youtu.be/-ukSo5ulyQQ

21/02/2023

Page numbers aren't the answer

https://shkspr.mobi/blog/2023/02/page-numbers-arent-the-answer/

17/02/2023

Microsoft to demo its new ChatGPT-like AI in Word, PowerPoint, and Outlook soon

https://www.theverge.com/2023/2/10/23593980/microsoft-bing-chatgpt-ai-teams-outlook-integration

15/02/2023

Establish JPEG quality with ImageMagick

identify -format '%Q\n' /ecur-001.jpg

Result:

92

All files in directory, result to CSV file:

identify -format '%f,%Q\n' ./images-BKT/* > BKT-quality.csv

Source:

https://stackoverflow.com/a/18378080/1209004

ImageMagick percent escapes:

https://imagemagick.org/script/escape.php

08/02/2023

Thorium browser

Chromium fork for Linux, MacOS, Raspberry Pi, and Windows named after radioactive element No. 90.

https://thorium.rocks/

07/02/2023

Missing name of embedded file in docx document

https://issues.apache.org/jira/browse/TIKA-3968

01/02/2023

A Guide to Python’s Virtual Environments

https://towardsdatascience.com/virtual-environments-104c62d48c54

28/01/2023

Modeling and Simulation in Python

Use Computation to Predict and Explain the World by Allen B. Downey

https://nostarch.com/modeling-and-simulation-python

https://allendowney.github.io/ModSimPy/

26/01/2023

DRM Library

Here you'll find my various research, notes, and random information about various kinds of DRM, and DRM tangential information.

https://github.com/TheRogueArchivist/DRML

Are Video Codecs... Done? by Derek Buitenhuis

Does anyone under the age of 50 work on codecs anymore? Have we made the barrier to entry so high that you need to spend 10 years banging your head against esoteric papers to understand everything in VVC? Are we all doomed to glue things together?

https://youtu.be/3zaq56QsX28

19/01/2023

pdfCop

pdfCop is a compiled/to-be-compiled Java project based on an ANTLR4 grammar file that describes how Content Streams are structured as per the PDF specification. pdfCop can tell you whether a content stream, a PDF file, or a snippet follows the specification or not and it will let you know where the provided syntax did go wrong.

https://github.com/itext/pdfcop

17/01/2023

Hello, PNG!

I'm writing this article to fulfil my role as a PNG evangelist, spreading the joy of good-enough lossless image compression to every corner of the internet. Similar articles already exist, but this one is mine.

https://www.da.vidbuchanan.co.uk/blog/hello-png.html

09/01/2023

How to use LaTeX with MyST Markdown

The MyST Tools project, https://myst.tools, includes a command line interface for creating websites, scientific articles, and parsing markdown, notebooks, JATS, and now also can parse and render LaTeXLATE​X directly! 🎉

https://curvenote.com/blog/how-to-use-latex-with-myst-markdown

07/01/2023

Uncurled

Uncurled – everything I know and learned about running and maintaining Open Source projects for three decades.

https://github.com/bagder/uncurled

13/12/2022

Standard Ebooks

Standard Ebooks takes ebooks from sources like Project Gutenberg, formats and typesets them using a carefully designed and professional-grade style manual, fully proofreads and corrects them, and then builds them to create a new edition that takes advantage of state-of-the-art ereader and browser technology.

https://standardebooks.org/

Thunderbird - checking for new messages in IMAP folders other than Inbox

http://kb.mozillazine.org/Checking_for_new_messages_in_other_folders_-_Thunderbird

06/12/2022

Test if file is UTF-8

Use isutf8 tool from moreutils (apt-get install moreutils):

isutf8 foo.txt

Result:

foo.txt: line 7, char 4, byte 580: Expecting bytes in the following ranges: 00..7F C2..F4.

Data Organization in Spreadsheets

[T]his article offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files.

https://www.tandfonline.com/doi/pdf/10.1080/00031305.2017.1375989

04/12/2022

New Leaves: Riffling the History of Digital Pagination

This article charts a fresh history of the development of digital pagination through a revisionist interrogation of three interrelated phenomena: 1. That digital pages do not behave as do their physical correlates but instead mimic earlier historical forms of print that fused pagination, scrolling, and the tablet form. 2. That the development of PDF was almost abandoned by Adobe’s board of directors, who could see no audience for it. 3. That there are other more robust lineages of constraint for digital pages from cinema and television.

https://eprints.bbk.ac.uk/id/eprint/43860/

Ten Simple Rules for Digital Data Storage

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005097

03/12/2022

Flow - online EPUB reader

https://www.flowoss.com/

Coqui-ai Text to Speech

https://github.com/coqui-ai/TTS

Quire

Quire is an open-source multiformat publishing tool designed for longevity, discoverability, and scholarship. Using a single set of plain text files, Quire creates books as authoritative and enduring as print and as vibrant and feature-rich as the web—all without paying a fee or maintaining a complicated server.

https://quire.getty.edu/

30/11/2022

PDF 2.0 Application Note 002: Associated Files

This document provides background to the dictionaries and other entries that define Associated Files in PDF 2.0. As such, it is intended for developers who want to learn about Associated Files in PDF, and how they can improve interoperability of content beyond the exchange of digital paper

https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf

27/11/2022

Bash errors

Add following line at start of a Bash script to prevent script continuing after errors:

set -euo pipefail

Source:

https://wizardzines.com/comics/bash-errors/

25/11/2022

DNS DMARC settings

https://wpmailsmtp.com/how-to-create-dmarc-record/

(Note to self: added record for main mail domain 25-1-2022)

DMARC reports

https://support.google.com/a/answer/10032472?hl=en

DKIM record

https://support.dnsimple.com/articles/dkim-record/

Added 20 January (using domain key from ISP).

SPF record

https://www.cloudflare.com/learning/dns/dns-records/dns-spf-record/

Update 20-1-2023: combined 3 existing SPF records (not allowed) into one single record, as per:

https://wpmailsmtp.com/fix-multiple-spf-records/

Also important:

Prevent mail to Gmail users from being blocked or sent to spam

Important: Starting November 2022, new senders who send email to personal Gmail accounts must set up either SPF or DKIM. Google performs random checks on new sender messages to personal Gmail accounts to verify they’re authenticated. Messages without at least one of these authentication methods will be rejected or marked as spam.

https://support.google.com/mail/answer/81126

Google Admin Toolbox

Includes DNS checking tools. Some of the results are specific to Google servers, but still useful to check e.g. SPF records:

https://toolbox.googleapps.com/apps/main/

24/11/2022

The Art of Command Line

https://github.com/jlevy/the-art-of-command-line

File format recommendations - I wouldn’t say they are unacceptable, but I wouldn’t recommend them either

https://www.dpconline.org/blog/file-format-recommendations

23/11/2022

How to write an alt-text image description

https://uxdesign.cc/how-to-write-an-image-description-2f30d3bf5546

WeasyPrint

Turns simple HTML pages into PDF documents, with (experimental at this stage) PDF/UA support:

https://weasyprint.org/

21/11/2022

Forensicswiki (the next generation)

https://forensics.wiki/

20/11/2022

Python on Windows

A step-by-step guide on installing Python and using the Command Prompt for Windows

https://github.com/pettarin/python-on-windows

19/11/2022

Twitter archive parser

https://github.com/timhutton/twitter-archive-parser

Does the following:

  • Converts the tweets to markdown and also HTML, with embedded images, videos and links.
  • Replaces t.co URLs with their original versions.
  • Copies used images to an output folder, to allow them to be moved to a new home.
  • Afterwards, it asks if you want to try downloading the original size images.

18/11/2022

Generate file hash for HTML script integrity attribute

From https://www.srihash.org/:

openssl dgst -sha384 -binary showdown.min.js | openssl base64 -A

Result:

TTjj1KxpUMxMChPbgmSWLlfEep0/67X86v9lnJMkldzkQGHZNAhZRgE9owovIRyz

Then pre-pend result with "sha384-". Example:

<script src="https://unpkg.com/[email protected]/dist/showdown.min.js"
integrity="sha384-TTjj1KxpUMxMChPbgmSWLlfEep0/67X86v9lnJMkldzkQGHZNAhZRgE9owovIRyz"
crossorigin="anonymous"></script>

17/11/2022

Retrieve metadata for all records in Zenodo community

Adapted from here.

Example below retrieves all records from kbnl community. By default the API only returns 10 records at a time. This can be remedied by uswing the 'size' parameter, and setting this to some arbitrary value that must be larger than the number of records (hits) that are covered by the query. Requires an access token (replace bogus value in example by real one).

"""Query Zenodo records in KB community"""

import io
import json
import requests

ACCESS_TOKEN = 'xxxxxxxxxxxxxxxxx'
maxRecords = '500'

response = requests.get('https://zenodo.org/api/records',
                        params={'access_token': ACCESS_TOKEN,
                        'communities': 'kbnl',
                        'size': maxRecords},
                        timeout=None)

with io.open('test.json', 'w', encoding='utf-8') as f:
    json.dump(response.json(), f)

Thunderbird does not show INBOX's subfolders

https://www.reddit.com/r/Thunderbird/comments/yqcejv/thunderbird_does_not_show_inboxs_subfolders/

workaround is in TB Server Settings -> Advanced Account Settings, uncheck "Show only subscribed folders"

How to subscribe to an IMAP folder with Thunderbird?

https://www.siteground.com/kb/how_to_subscribe_to_an_imap_folder_with_thunderbird/

Important Message For Microsoft Office 365 Enterprise Users

https://blog.thunderbird.net/2022/11/important-message-for-microsoft-office-365-enterprise-users/

In order to meet Microsoft’s requirements for publisher verification, it is necessary for us to switch to a new Azure application and application ID. However, some of these accounts are configured to require administrators to approve any applications accessing email.

03/11/2022

Digipres resources for all, for good, forever

https://www.dpconline.org/blog/wdpd/wdpd2022-jackson

20/10/2022

ImageMagick build and install

First install needed delegate libraries (NOTE: not obvious from IM documentation what are the actual package names):

sudo apt install libtiff5-dev
sudo apt install libpng-dev

OpenJPEG: build and install from source (not sure Debian version is up to date)? See also here.

Then run:

./configure

followed by:

make

Then:

sudo make install

and finally:

sudo ldconfig /usr/local/lib

Binaries in /usr/local/bin/ (note: old version still exists, uninstall!)

27/09/2022

Writing binary by hand

https://martin.hoppenheit.info/blog/2022/writing-binary-by-hand/

Literate Binary

Craft binary files from Markdown:

https://github.com/marhop/literate-binary

17/08/2022

PDF Playground

Type some valid PDF syntax on the left, and you'll see the output on the right.

https://dubroy.com/pdf-playground/

10/08/2022

Add physical USB floppy drive to VM in VirtualBox

  • In VM Settings, go to USB
  • Click "Adds new USB filter, fields set to values of selected USB device attached to the host PC"
  • Select floppy drive from list

Floppy drive is now available in guest VM after starting it up. Note: in my case it as automatically mapped to the A:\ drive.

There's NO need to set up anything in the Storage settings (Floppy controller there only works for virtual floppies!).

This also works for any other USB storage devices (e.g. thumbdrives).

When forensic write blocker (Tableau T8u USB 3.0 Bridge) is placed between host PC and floppy drive, watch out for the following:

  • Write blocker must be added as separate device (so add new USB filter as described above)
  • For some reason VM crashes if write blocker is enabled on startup. Workaround: disable (deselect) write blocker in VM settings, start up the VM, then select device from list of USB devices (USB icon at bottom ogf VM window)
  • In this case the device is not mapped to A: (apparently Windows doesn't see it's a floppy drive), but some other drive (e.g. E:)
  • Related to the above, the mediaType and deviceType values as described here will be less specific (RemovableMedia with write blocker vs F3_1Pt44_512 without for Media Type)

List all storage devices on Linux machine

lsblk -o KNAME,TYPE,SIZE,MODEL

Result:

KNAME TYPE   SIZE MODEL
sda   disk 931,5G TOSHIBA_DT01ACA100
sda1  part  55,9G 
sda2  part   7,5G 
::
::
sdd   disk   1,4M USB-FDU

03/08/2022

High throughput JPEG 2000 (HTJ2K)

Limks to open-source encoder, white papers and web-based demos:

https://github.com/aous72/OpenJPH

Looks like Kakadu can already write the JPH format (under kdu_compress advanced Part-15 (HTJ2K) Features):

https://kakadusoftware.com/wp-content/uploads/Usage_Examples.txt

Recording of IIIF community call on HTJ2K:

https://youtu.be/nzkn0W2esOQ

01/08/2022

Free vector-based images of legacy media formats

By Ashley Blewer, all CC-BY licensed:

https://github.com/ablwr/illustrations

19/07/2022

Disable CPU frequency scaling (Renoise)

During installation of Renoise this message is shown:

Checking CPU frequency scaling... Your CPU frequency governor is NOT set to 'performance'. It's HIGHLY RECOMMENDED to disable CPU frequency scaling for realtime audio applications.

With link to:

https://wiki.linuxaudio.org/wiki/system_configuration#cpu_frequency_scaling

Following instructions there, checked current settings:

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Result:

caling_governor
powersave
powersave
powersave
powersave

So changed using:

echo -n performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

TODO: set up a service to do this at startup as explained on linuxaudio wiki.

14/07/2022

List MIDI devices in Linux

aplaymidi -l

Result:

 Port    Client name                      Port name
 14:0    Midi Through                     Midi Through Port-0
 16:0    Scarlett 2i4 USB                 Scarlett 2i4 USB MIDI 1
 20:0    cubit duo                        cubit duo MIDI 1
 28:0    Arturia KeyStep 37               Arturia KeyStep 37 MIDI 1

08/07/2022

Minimal Long-Term Storage Economic Model

https://economicmodel.dshr.org/

Blog:

https://blog.dshr.org/2022/07/economic-model-revived.html

22/06/2022

Archived PDFs from Adobe Acrobat Engineering site

https://web.archive.org/web/*/http://acroeng.adobe.com/Test_Files/*

Durability of recordable DVD±R and DVD made of glass (Syylex) at elevated temperature and humidity

Especially interesting results for M-DISC:

The DVD + R with inorganic recording layer such as M-DISC and DataTresorDisc show no longer lifetimes than conventional DVD±R.

https://www.lne.fr/sites/default/files/inline-files/syylex-glass-dvd-accelerated-aging-report.pdf

16/06/2022

Make screen capture of Windows window with ffmpeg

ffmpeg -f gdigrab -framerate 30 -i title="Lotus ScreenCam Playback View" -f m4v gentopac.mp4

With -title is name of window to capture. Doesn't work on Linux!

09/06/2022

Storage media type detection using the Windows API and Python

https://gist.github.com/bitsgalore/cd30cb8c20c856651b4b858b5f4ee7b0

31/05/2022

Display BIOS-embedded Windows activation key under Linux

sudo strings /sys/firmware/acpi/tables/MSDM

(source)

21/05/2022

Lotus 1-2-3 For Linux

https://lock.cmpxchg8b.com/linux123.html

Github:

https://github.com/taviso/123elf

18/05/2022

CPU temperature check

Package lm-sensors. Command:

sensors

Crypto-preservation and the ghost of Andy Warhol

By Jon Ippolito:

https://jonippolito.net/writing/ippolito_warhol_nfts_preprint_for_mdpi_2022.html

12/05/2022

Zenodo_get

Download all files from Zenodo record:

https://github.com/dvolgyes/zenodo_get

Example:

zenodo_get https://zenodo.org/record/2556637

Results in 1 PDF document and 1 checksum file.

30/04/2022

PyScript

Run Python in Your HTML:

https://pyscript.net/

20/04/2022

How to Rename the master branch to main in Git

https://www.git-tower.com/learn/git/faq/git-rename-master-to-main

13/04/2022

fq

Tool, language and decoders for working with binary data:

https://github.com/wader/fq

12/04/2022

Install Python under Wine

Wine (5.0) refuses to install recent Python installers. Alternative options:

  • Unzip package (regular ZIP file, despite extension)
  • Copy contents of "tools" directory to ~/.wine/drive_c

2. Use embeddable package

https://www.python.org/downloads/release/python-3104/

But this requires some really tedious post-install configuration:

https://gist.github.com/jtmoon79/ce63fe655b2f544462e70d8e5ec30ff5

Using option 2. I finally got stuck trying to install PyInstaller, which failed with a pip error. No idea why.

Cross-platform CLI file input with wildcards in Python

test-cli-wildcard.py

08/04/2022

KB data services and APIs documentation

https://web.archive.org/web/20220204153243/https://www.kb.nl/en/resources-research-guides/data-services-apis

Technische uitleg dataset EDBO

https://web.archive.org/web/20220408132300/https://webcache.googleusercontent.com/search?q=cache%3Ahttps%3A%2F%2Fwww.kb.nl%2Fsites%2Fdefault%2Ffiles%2Fdocs%2Ftechniek-edbo_0.pdf

10/03/2022

Grok build and install

https://wiki.harvard.edu/confluence/display/DigitalImaging/Installing+OpenJPEG+on+Windows+10%2C+Linux%2C+and+MacOS

09/03/2022

CollectionBuilder

CollectionBuilder is an open source tool for creating digital collection and exhibit websites that are driven by metadata and powered by modern static web technology.

https://collectionbuilder.github.io/

Jekyll-based templates for building digital collections and exhibits exploring static web solutions for libraries:

https://github.com/CollectionBuilder

08/03/2022

Bypass Twitter login popup

  • Open uBlock Origin Dashboard
  • Click "My Filters" tab
  • Paste in following code:
    twitter.com##div#layers div[data-testid="sheetDialog"]:upward(div[role="group"][tabindex="0"])
    twitter.com##html:style(overflow: auto !important;)
    
  • Click "Apply Changes"

Source: here.

04/03/2022

Repairing damaged CDs

https://wiki.slimdevices.com/index.php/Repairing_damaged_CDs.html

Media Formats pages on Logitech SqueezeBox Wiki

https://wiki.slimdevices.com/index.php/Category_Media_formats.html

04/02/2022

A LaTeX to XML/HTML/MathML Converter

https://math.nist.gov/~BMiller/LaTeXML/

Articles from arXiv.org as responsive HTML5 web pages.

https://ar5iv.org/

JHOVE PDF validation

https://files.dnb.de/nestor/weitere/ipres2017.pdf

and:

https://openpreservation.org/blogs/pdf-validation-with-exiftool-quick-and-not-so-dirty/

and (related):

https://wiki.opf-labs.org/display/Documents/JHOVE+issues+and+error+messages

But it seems the detail links to the specific JHOVE errors are dead (all point to BL fork of JHOVE Github repo).

27/01/2022

Brief Overview of the Portable Document Format (PDF) and Some Challenges for Text Extraction

Article by Tim Allison:

https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf

24/01/2022

Count all instances of xml element (works for arbitrary namespaces)

Count all "file" elements in "conf-all.xml":

xmllint --xpath "count(//*[local-name()='file'])" conf-all.xml

23/01/2022

Parsr

Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in JSON, Markdown (MD), CSV/Pandas DF or TXT formats.

https://github.com/axa-group/Parsr

02/01/2022

Linux configuration for Reaper

This fixes pops/crackle problems while recording audio. Main steps here:

https://forum.cockos.com/showthread.php?t=210390

First install low-latency kernel:

sudo apt install linux-lowlatency

Then edit /etc/security/limits.conf and add following entries:

@audio           -       rtprio           98
@audio           -       memlock         unlimited

Add user to group audio:

sudo usermod -a -G audio myusername

Possibly relevant: clearlinux/distribution#2372

22/12/2021

Obsidian

Obsidian is a powerful knowledge base on top of a local folder of plain text Markdown files.

https://obsidian.md/

17/12/2021

CD+G (CD+Graphics)

CD+G (CD+Graphics) is an an extension of the Compact Disc format that can present low-resolution graphics on a television alongside the audio data on the disc when played on a compatible device.

https://obsoletemedia.org/cdg/

09/12/2021

Preserving Immersive Media Knowledge Base

The Preserving Immersive Media Knowledge Base is a resource created to help share information between members of the digital preservation community who are caring for virtual reality (VR), augmented reality (AR), mixed reality (MR), 360 video, real-time 3D software and other similar materials.

https://pimkb.gitbook.io/preserving-immersive-media-knowledge-base/

Hamburgers and Cows; The Cognitive Style of PDF

https://blogs.ch.cam.ac.uk/pmr/2006/09/10/hamburgers-and-cows-the-cognitive-style-of-pdf/

08/12/2021

Grappling with the Scale of Born-Digital Government Publications: Toward Pipelines for Processing and Searching Millions of PDFs

https://arxiv.org/abs/2112.02471

27/11/2021

Bitrot resistance of next-generation image formats

https://www.ctrl.blog/entry/bitrot-avif-jxl-comparison.html

RGBA Structural Similarity

This tool computes (dis)similarity between two PNG images using (my approximation of) algorithms approximating human vision.

https://kornel.ski/dssim

Encyclopedia of graphics file formats

10/11/2021

Encyclopedia of graphics file formats

https://archive.org/details/mac_Graphics_File_Formats_Second_Edition_1996/mode/2up

09/11/2021

Pure 3D Technical Report

This PURE3D Technical Report is meant to provide a high-level state of the art summary on 3D scholarly web infrastructures.

https://pure3d.eu/wp-content/uploads/2021/09/Pure3D_Technical-Report.pdf

07/11/2021

Programming Machine Learning - From Coding to Deep Learning

https://pragprog.com/titles/pplearn/programming-machine-learning/

04/11/2021

What are the Barriers to Teaching Digital Forensics?

https://educopia.org/what-are-the-barriers-to-teaching-digital-forensics/

22/10/2021

Textract

Multi-format text extraction in Python:

https://textract.readthedocs.io/en/stable/

Aaru Data Preservation Suite

With Aaru you can identify a media dump, extract files from it (for supported filesystems), compare two of them, create them from real media using the appropriate drive, create a sidecar metadata with information about the media dump, and a lot of other features that commonly would require you to use separate applications.

https://github.com/aaru-dps/Aaru

19/10/2021

Copying files isn't always a straightforward process

https://blog.suppliedtitle.org/2021/10/19/copying-files-isnt-always-a-straightforward-process-or-some-things-ive-learned-working-with-digital-archives.html

08/10/2021

Twitter mute keywords

Here are some terms to mute on Twitter to clean your timeline up a bit.

https://gist.github.com/IanColdwater/88b3341a7c4c0cf71c73ac56f9bd36ec

29/09/2021

Making Sense of PDF Structures in the Wild at Scale

Slides, presentation Tim Allison, PDF Days:

https://zenodo.org/record/5539013

28/09/2021

The Arlington PDF Model

https://github.com/pdf-association/arlington-pdf-model

Software-preservation's Zotero Library

https://www.zotero.org/software-preservation/items/UKUFEWPD/library

21/09/2021

Wotsit.org

the programmer's file and data format resource

https://web.archive.org/web/20140103020659/http://www.wotsit.org/

BUT links don't work bc site was blocking crawlers!

Mirror in IA:

https://archive.org/details/2018_10_23__www.wotsit.org

File formats - wotsit.org alternative by PeatSoft

https://hwiegman.home.xs4all.nl/file-formats.html

07/09/2021

pdftk and php-pdftk on Ubuntu 18.04 without using snap

https://www.joho.se/2020/10/01/pdftk-and-php-pdftk-on-ubuntu-18-04-without-using-snap/

06/09/2021

Overview of the 'tika-eval' Module

https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=109454084#content/view/109454084

05/09/2021

How to Change Your Browser's User Agent and Trick Websites

https://www.makeuseof.com/tag/trick-websites-changing-user-agent-chrome/

24/07/2021

Symbol salad

Tap a button below, and that symbol will be copied into your clipboard for you to paste where needed.

https://symbolsalad.com/

14/07/2021

Multiple search & replace on many files

#!/bin/bash

# Schemas dir
schemasDir=/home/johan/kb/jprofile/jprofile/schemas

nsOld=http://openpreservation.org/ns/jpylyzer/
nsNew=http://openpreservation.org/ns/jpylyzer/v2/
rootEltOld=j:jpylyzer
rootEltNew=j:file
validOld=isValidJP2
validNew=isValid
elTextOld="no jpylyzer element found"
elTextNew="no file element found"

while IFS= read -d $'\0' -r file ; do
    sed -i "s|$nsOld|$nsNew|g" $file
    sed -i "s|$rootEltOld|$rootEltNew|g" $file
    sed -i "s|$validOld|$validNew|g" $file
    sed -i "s|$elTextOld|$elTextNew|g" $file

done < <(find $schemasDir -type f -name '*.sch' -print0)

02/07/2021

PDFMiner

PDFMiner is a text extraction tool for PDF documents

https://pypi.org/project/pdfminer/

Include dumppdf.py which "is used for debugging PDFs. It dumps all the internal contents in pseudo-XML format."

11/06/2021

QEMU QED

https://eaasi.gitlab.io/program_docs/qemu-qed/

28/05/2021

Tips for Using the Internet Archive’s Wayback Machine in Your Next Investigation

https://gijn.org/2021/05/05/tips-for-using-the-internet-archives-wayback-machine-in-your-next-investigation/

Save Page Now 2 Public API Docs Draft

https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit

25/05/2021

Create Nyquist plugins for Audacity

https://manual.audacityteam.org/man/creating_nyquist_plug_ins.html

Store here:

/home/johan/.audacity-files/plug-ins

24/05/2021

Install Wine on Linux Mint 20.1

apt install wine-installer

17/05/2021

Assessing risk in Office documents

Are graphene-coated face masks a COVID-19 miracle – or another health risk?

https://theconversation.com/are-graphene-coated-face-masks-a-covid-19-miracle-or-another-health-risk-159422

30/04/2021

PDF Specification Index

https://www.pdfa.org/resource/pdf-specification-index/

22/04/2021

WaybackPy

https://pypi.org/project/waybackpy/

Save list of URLs to internet Archive's Wayback Machine:

https://gist.github.com/bitsgalore/46ac9279a2e18f784feb7372cf280b39

Subdomain finder

https://subdomainfinder.c99.nl

Reverse Whois Lookup

https://www.reversewhois.io/

15/04/2021

Extract object data from PDF

PDF with embedded Shockwave Flash data. After poking around the file in a Hex editor I found this object, which appears to hold some Flash data (search pattern: Subtype entry with value application#2Fx-shockwave-flash):

151 0 obj<</Length 18555
    /Subtype/application#2Fx-shockwave-flash
    /Params
    <<
      /Size 18555
      /CheckSum<acb03efbfee3ef1229f055ced91fc1aa>>>
      /DL 18555
    >>
    stream
    ....stream data with Shockwave Flash content ...
    endstream
endobj

To extract the data stream (everything between stream and endstream), use MuPDF's mutool:

mutool show -b Disney-Flash.pdf 151 > disney.swf

Resulting file is identified as Shockwave flash by Unix File (but oddly not by Siegfried).

13/04/2021

Trim video

Only keep frames betweeen t=40 s and t=60s:

ffmpeg -i Windows-3.webm -ss 40 -to 60 output.webm

07/04/2021

BnF File Formats Wiki

https://github.com/hackathonBnF/FichesFormat/wiki

06/04/2021

Check PDF document for broken links with pdfcpu

pdfcpu validate -l PDFInventoryPreservationRisks_0_2.pdf

Result:

validating(mode=relaxed) PDFInventoryPreservationRisks_0_2.pdf ...
validating URIs..
............
Page 55: http://www.jpeg.org/jpeg2000/CDs15444.html status=404
Page 55: http://www.f-secure.com/vulnerabilities/SA30832 status=404
Page 55: http://www.planetpdf.com/mainpage.asp?WebPageID=362 status=404
validation error: broken links detected

01/04/2021

Web Archive Embeds Starter

This project demonstrates archiving embeds with ReplayWeb.page.

https://glitch.com/edit/#!/web-archive-embeds-starter?path=README.md%3A1%3A0

31/03/2021

MD-Project

Workflows for transferring contents from MiniDisc using open-source tools:

https://github.com/jyw321/MD-Project

A Floppy Controller For The Raspberry Pi

https://hackaday.com/2021/03/30/a-floppy-controller-for-the-raspberry-pi/

13/03/2021

FROM MY TO ME

In my opinion, people struggling to position a dripping blood animation in between two skulls and under ENTER IF YOU DARE, and pick up an appropriate MIDI tune to sync with the blood drip, made an important contribution to showing the beauty and limitation of web browsers and HTML code.

https://interfacecritique.net/book/olia-lialina-from-my-to-me

11/03/2021

Adobe Acrobat 1.0 serial number

https://web.archive.org/web/20180412015446/https://www.findserialnumber.net/acrobat-reader-for-dos-1-0-serial-number-keygen-a4d9f5b8.html#

09/03/2021

Internet Archive Scholar

https://scholar.archive.org/

Blog:

https://blog.archive.org/2021/03/09/search-scholarly-materials-preserved-in-the-internet-archive/

04/03/2021

Create Siegfried signatures from custom DROID sigs

  1. Copy DROID sigs (regular + container) to Siegfried homedir:
sudo cp *.xml /usr/share/siegfried/
  1. Run roy build with -noreports, -droid and -container flags:
sudo roy build -noreports \
     -droid /usr/share/siegfried/ipa-standard-signature-file-v1-03-03-21.xml \
     -container /usr/share/siegfried/ipa-CHLdev1-signaturefile-20210303.xml

To revert to original signatures, run:

sudo roy build

What’s Up, (with Google) Docs? – The Challenge of Native Cloud Formats

https://www.dpconline.org/blog/whats-up-with-google-docs

02/03/2021

Malicious / problematic PDF corpii

25/02/2021

Detect characters that are not mapped to Unicode in a PDF

https://issues.apache.org/jira/browse/TIKA-3305

Processing Dangerous Paths– On Security and Privacy of the Portable Document Format

https://www.ndss-symposium.org/wp-content/uploads/ndss2021_1B-2_23109_paper.pdf

24/02/2021

FritzBox DSL connection drops frequently

https://en.avm.de/service/fritzbox/fritzbox-7590/knowledge-base/publication/show/41_DSL-connection-drops-frequently/

18/02/2021

Run macOS on QEMU/KVM

This README.md documents the process of creating a Virtual Hackintosh system.

https://github.com/kholia/OSX-KVM/

14/02/2021

Emulation resources

List by Ethan Gates:

https://github.com/EG-tech/emulation-resources

28/01/2021

Testing Android apps on a virtual machine

https://www.sjoerdlangkemper.nl/2020/05/06/testing-android-apps-on-a-virtual-machine/

26/01/2021

If I build a package from source how can I uninstall or remove completely?

https://askubuntu.com/questions/87111/if-i-build-a-package-from-source-how-can-i-uninstall-or-remove-completely

Esp:

In the future to avoid that kind of problems try to use checkinstall instead of make install whenever possible (AFAIK always unless you want to keep both the compiled and a packaged version at the same time). It will create and install a deb file that you can then uninstall using your favorite package manager.

10/01/2021

Open Scientist Handbook

https://openscientist.pubpub.org/pub/play/release/1

08/01/2021

Pdf-issues

This public repository is hosted by the PDF Association in order to provide developers with a means of openly reporting issues against the latest core PDF 2.0 specification (ISO 32000-2:2020) for review and resolution by industry and ISO experts.

https://github.com/pdf-association/pdf-issues

05/01/2021

Python command-line bootstrap

This is a structure template for Python command line applications, ready to be released and distributed via setuptools/PyPI/pip for Python 2 and 3.

https://github.com/jgehrcke/python-cmdline-bootstrap

12/12/2020

Dead Simple Python

https://dev.to/codemouse92/introducing-dead-simple-python-563o

28/11/2020

Microkorg editor troubleshooting

If loading sounds from .prg files gives unexpected results: check that Midi channel is set to 1 before launching the editor! Reportedly MIDI clock needs to be set to external as well.

14/11/2020

A Visual Guide to Regular Expression

In this post, I will illustrate the various concepts underlying regex. The goal is to help you build a good mental model of how a regex pattern works.

https://amitness.com/regex/

11/11/2020

List of applications - ArchWiki

[A] general list of applications sorted by category, as a reference for those looking for packages. Many sections are split between console and graphical applications.

https://wiki.archlinux.org/index.php/List_of_applications

03/11/2020

How to move /var/www/html folder to external hdd?

https://superuser.com/questions/1101851/how-to-move-var-www-html-folder-to-external-hdd/1101856

Also:

https://askubuntu.com/questions/1220778/how-can-web-server-access-external-hdd

29/10/2020

Thorium Reader

Thorium Reader is an easy to use EPUB reading application for Windows 10/10S, MacOS and Linux.

https://github.com/edrlab/thorium-reader/releases

21/10/2020

Apache: redirect all folderv root references to home.htm file in folder

This seems to work:

RedirectMatch ^(.*)/$ $1/home.htm

16/10/2020

MIDI not working under Jack / Reaper

In View menu, open routing matrix and click on system:midi midi playback2 (needs to be enabled first from Preferences). Routing is set for each track.

15/10/2020

Virtualbox fails after kernel update

https://askubuntu.com/questions/819939/virtualbox-fails-after-kernel-update

06/10/2020

The Quartz guide to bad data

An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.

https://github.com/Quartz/bad-data-guide

14/09/2020

What's so hard about PDF text extraction?

https://filingdb.com/b/pdf-text-extraction

11/09/2020

More than 100 scientific journals have disappeared from the Internet

https://www.nature.com/articles/d41586-020-02610-z

07/09/2020

ftfy: fixes text for you

ftfy fixes Unicode that's broken in various ways.

https://github.com/LuminosoInsight/python-ftfy

03/09/2020

QGIS Flatpak instructions

https://www.qgis.org/en/site/forusers/alldownloads.html#flatpak

01/09/2020

Keep Remote SSH Sessions and Processes Running After Disconnection

https://www.tecmint.com/keep-remote-ssh-sessions-running-after-disconnection/

Steps:

screen

Then issue commands. Then press Ctrl-a followed by d to detach. Log out.

31/08/2020

Linux - display details on startup/boot

systemd-analyze time

Result (in this case there's some odd firmware delay):

Startup finished in 1min 55.160s (firmware) + 10.965s (loader) + 3.955s (kernel) + 10.002s (userspace) = 2min 20.085s
graphical.target reached after 9.996s in userspace

Detailed breakdown:

systemd-analyze blame

Result:

          7.416s NetworkManager-wait-online.service
          1.966s vboxdrv.service
           827ms apt-daily-upgrade.service
           558ms systemd-fsck@dev-disk-by\x2duuid-9224\x2d4AC1.service
           500ms dev-sdb1.device
           477ms systemd-journal-flush.service
           ::   ::
           

Split large text file into smaller files

Here, split into 500,000-line files:

split -l 500000 -d 2019-05-21_all_domains_NL.txt domains-nl

30/08/2020

How to delete all your files

https://www.reddit.com/r/linux/comments/if1krd/how_to_delete_all_your_files/

26/08/2020

PDF info/validation/testing commands

qpdf --check --verbose whatever.pdf
pdfinfo whatever.pdf

Or (forces reading of all text):

pdftotext whatever.pdf
jhove -m PDF-hul -i whatever.pdf
gs -dNOPAUSE -dBATCH -sDEVICE=nullpage whatever.pdf

Using PDFDebugger (activates GUI-type browser):

java -jar ~/pdfbox/pdfbox-app-2.0.21.jar PDFDebugger whatever.pdf
mutool info whatever.pdf 
verapdf whatever.pdf

(Or use GUI).

pdfcpu validate whatever.pdf

Note to self: installed this by copying the Linux binary to ~/.local/bin/ (doesn't require GoLang).

Compare two PDFs

Compare text (verbose output):

comparepdf ct -v=2 whatever.pdf wherever.pdf

Compare appearance (verbose output):

comparepdf ca -v=2 whatever.pdf wherever.pdf

12/08/2020

Reaper reports JACK: error creating client error omn startup

First run jackd:

jackd -dalsa -dhw:USB -r48000 -p128 -n3 -Xseq

See also here

02/08/2020

Convert stereo audio file to mono, changing bit depth and sampling frequency

To 8-bit, 15Khz:

sox versatility.wav -b 8 -r 15k versatility_8.wav remix -

BUT sox output is really noisy; better results with ffmpeg:

ffmpeg -i boc-arpeggio.wav -ar 15000 -acodec pcm_u8 boc-arpeggio-8ff.wav

27/07/2020

How to hide a list in HTML without javascript

https://stackoverflow.com/a/13127738/1209004

14/07/2020

Create shared folder on local network (Linux Mint, Caja file manager)

From instructions here:

Install samba and caja-share

sudo apt install samba
sudo apt install caja-share

Set up usershares folder and make sambashare group owner

sudo mkdir /var/lib/samba/usershares
sudo chgrp sambashare /var/lib/samba/usershares
sudo chmod 1770 /var/lib/samba/usershares

Set samba password

sudo smbpasswd -a your_username

Then reboot machine, and right-click folder in Caja and select sharing options. After this, folder is accessible from other machines on the local network.

06/07/2020

Python - CGI Programming

https://www.tutorialspoint.com/python/python_cgi_programming.htm

30/06/2020

U.S. National Archives and Records Administration Digital Preservation Framework

150 formats added in latest release:

https://github.com/usnationalarchives/digital-preservation

Convert Kazam output to HTML5-compatible MP4

ffmpeg -i mirror.mp4 -vcodec libx264 -pix_fmt yuv420p -profile:v baseline -level 3 -strict -2 mirror-264.mp4

(Source)

24/06/2020

HTML video elements in local Jekyll site not working in Chrome

https://stackoverflow.com/questions/48876911/embedded-local-mp4-not-playing-in-chrome-when-running-jekyll-serve-econnreset

Apparently works when deployed live:

https://exoji2e.github.io/2019/02/18/video-tag-in-chrome.html

18/06/2020

Enable CGI Scripts on Apache

https://www.ionos.com/community/server-cloud-infrastructure/apache/enable-cgi-scripts-on-apache/

But this assumes 1 fixed dir for cgi scripts.

Apache Tutorial: Dynamic Content with CGI

https://httpd.apache.org/docs/2.4/howto/cgi.html

This explains how to set custom script locations.

17/06/2020

File naming conventions based on Semantic tagging

https://karl-voit.at/managing-digital-photographs/

Tools here:

https://github.com/novoid

19/05/2020

Prospect Mail

The Outlook desktop client for the new Outlook Interface from MS Office 365.

https://github.com/julian-alarcon/prospect-mail

18/05/2020

Test images Developer's Image Library

https://sourceforge.net/p/openil/svn/1554/tree/trunk/Test%20Images/

JPEG 2000 (Bandcamp)

https://jpeg2000.bandcamp.com

Bitrot tool

Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay.

https://github.com/ambv/bitrot

16/05/2020

Python for AV

https://lis655.github.io/av-python-carpentry/

13/05/2020

JPEG White Paper:JPEG XL image coding system

http://ds.jpeg.org/whitepapers/jpeg-xl-whitepaper.pdf

Setting up Python-based web server

Just run:

python3 -m http.server

Then site can be accessed from:

http://127.0.0.1:8000/

Useful for testing with local files, not suitable for production. More info:

https://developer.mozilla.org/en-US/docs/Learn/Common_questions/set_up_a_local_testing_server

08/05/2020

Two Bit Bash Script Library

https://twobitpreservation.com/script-library

07/05/2020

Download videos from YouTube (and more sites)

https://ytdl-org.github.io/youtube-dl/index.html

03/05/2020

We read the privacy policies of Skype, Meet, and Webex: 10 ways videoconferencing systems can better protect privacy for customers

https://medium.com/cr-digital-lab/skype-meet-webex-videoconference-privacy-845bc8360fd3

02/05/2020

Digital Repair Cafe (Project CEST)

Lijkt qua doelen en scope erg op NDE project fysieke dragers:

https://automatic-ingest-digital-archives.github.io/Digital-Repair-Cafe/

Kijk bv ook hiernaar, "Handleiding Verouderde Dragers Herkennen":

https://www.projectcest.be/wiki/Publicatie:Handleiding_Verouderde_Dragers_Herkennen

How to Read a Floppy Disk on a Modern PC or Mac

https://www.howtogeek.com/669331/how-to-read-a-floppy-disk-on-a-modern-pc-or-mac/

30/04/2020

Reduce PDF file size

Using Ghostscript:

https://askubuntu.com/a/256449/1052776

23/04/2020

Choosing the right video conferencing tool for the job

https://freedom.press/training/blog/videoconferencing-tools/

COVID-19 and Cybersecurity

https://medium.com/@gdbelvin/covid-19-and-cybersecurity-e9ee5cba6de7

SPARQL queries YUL digital preservation

https://www.wikidata.org/wiki/User:YULdigitalpreservation/SPARQL2#Disk_image_file_formats

17/04/2020

Preservica adds headers/footers to exported HTML files

wellcomecollection/platform#4425

16/04/2020

The Robustness of Apache Tika

https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika

09/04/2020

How to use Jitsi Meet, an open source Zoom alternative

https://mashable.com/article/how-to-use-jitsi-meet-zoom-alternative/

05/04/2020

Malware Analysis Fundamentals - Files | Tools

https://winitor.com/pdf/Malware-Analysis-Fundamentals-Files-Tools.pdf

02/04/2020

The best alternatives to Zoom for videoconferencing

https://www.theverge.com/2020/4/1/21202945/zoom-alternative-conference-video-free-app-skype-slack-hangouts-jitsi

01/04/2020

Github Wikis

https://help.github.com/en/github/building-a-strong-community/about-wikis

And:

https://help.github.com/en/github/building-a-strong-community/adding-or-editing-wiki-pages

Simple-Jekyll-Search

A JavaScript library to add search functionality to any Jekyll blog:

https://github.com/christian-fei/Simple-Jekyll-Search

27/03/2020

Jitsi installation instructions

https://jitsi.org/downloads/ubuntu-debian-installations-instructions/

Jitsi servers NL

https://vc4all.nl/

26/03/2020

Books.Files: Preservation of Digital Assets in the Contemporary Publishing Industry

https://drum.lib.umd.edu/handle/1903/25605

22/03/2020

Digital preservation policies and strategies (Caylin Smith)

https://docs.google.com/spreadsheets/d/1nAPh6M5c2VlvuFtdMIDEfxwdLvQ-47-i0ZicUUGkzjM/edit#gid=0

21/03/2020

Disable / enable webcam from terminal

Disable until reboot:

sudo modprobe -r uvcvideo

Enable again:

sudo modprobe uvcvideo

Source

18/03/2020

Create large test file with only null bytes

For a 1 MB file:

dd if=/dev/zero of=file.dat count=1024 bs=1024

Same, 1 GB file:

dd if=/dev/zero of=file.dat count=1024 bs=1048576

Source

17/03/2020

Wasmachine geeft overdosering aan

https://www.wasmachines.nl/forum/457-miele-w2203-lampje-overdosering/

https://community.consumentenbond.nl/woning-huishouden-8/miele-wasmachine-trommelkruis-designed-to-fail-16834

Maar:

https://www.klusidee.nl/Forum/miele-w-3821-wasmachine-meldt-contr-dosering-t46008.html

Dus: was op 95 graden, anders speciaal reinigingsmiddel.

16/03/2020

WordToEPUB

https://daisy.org/activities/software/wordtoepub/

Announcement:

https://daisy.org/news-events/articles/new-epub-creation-tool/

12/03/2020

OneDrive – Some files weren’t downloaded

https://web.archive.org/web/20190704152920/http://yannickborghmans.com/2018/05/19/onedrive-some-files-werent-downloaded/

11/03/2020

Download files and folders from OneDrive or SharePoint

https://support.office.com/en-us/article/download-files-and-folders-from-onedrive-or-sharepoint-5c7397b7-19c7-4893-84fe-d02e8fa5df05

Downloads are subject to the following limits: individual file size limit: 10GB; total zip file size limit: 20GB; total number of files limit: 10,000.

10/03/2020

Unzipping 6 GB OneDrive ZIP file under Linux fails

Reworked this into a blog:

https://www.bitsgalore.org/2020/03/11/does-microsoft-onedrive-export-large-ZIP-files-that-are-corrupt

04/03/2020

Map Windows Folder to a Drive Letter for Quick and Easy Access

https://www.raymond.cc/blog/map-folder-or-directory-to-drive-letter-for-quick-and-easy-access/

03/03/2020

What's so hard about PDF text extraction?

https://www.filingdb.com/pdf-text-extraction

02/03/2020

Graphviz

Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks.

https://www.graphviz.org/

01/03/2020

Bot Sentinel

Bot Sentinel is a free platform developed to detect and track trollbots and untrustworthy Twitter accounts.

https://botsentinel.com/

The Importance of Digital Persistence

https://philarcher.org/diary/2020/importanceOfPersistence/

25/02/2020

How to Sync Microsoft OneDrive with Linux

https://www.maketecheasier.com/sync-onedrive-linux/

21/02/2020

COinS for Your Jekyll Blog

https://matthewlincoln.net/2014/03/15/coins-for-your-jekyll-blog.html

17/02/2020

Persistent identifiers for heritage objects

https://journal.code4lib.org/articles/14978

15/02/2020

Google Webfonts Helper

https://google-webfonts-helper.herokuapp.com/fonts

14/02/2020

Notes on the Troubleshooting and Repair of Compact Disc Players and CDROM Drives

https://www.repairfaq.org/sam/cdfaq.htm

Check items under "Intermittent or erratic operation" and "Operation is poor or erratic when cold".

NAD CD player repair video

https://www.youtube.com/watch?v=jAehSoTmLGY

12/02/2020

Jekyll without plugins

https://jekyllcodex.org/without-plugins/

DLF Levels of Born-Digital Access

https://osf.io/af4eq/

04/02/2020

Firefox web archives add-on

https://github.com/dessant/web-archives

28/01/2020

Accessing Digital Archives Guide, UNC Library

https://guides.lib.unc.edu/accessdigitalarchives

Geolocate URL

Command-line:

https://www.maketecheasier.com/ip-address-geolocation-lookups-linux/

Python:

https://pypi.org/project/geoip2/

Uses MaxMind databases.

BUT getting IP address from URL is difficult in python, so perhaps better to use bash:

https://linuxhandbook.com/find-website-ip-address-linux/

Windows registry code for Pandoc context menu item

Windows Registry Editor Version 5.00

[HKEY_CLASSES_ROOT\*\shell\mkd2doc]
[HKEY_CLASSES_ROOT\*\shell\mkd2doc\command]
@="\"F:\\Pandoc\\pandoc.exe\" -s -S --ascii -N --toc-depth=2 \"%1\" -o \"%1.docx\""

Then save as pandoc.reg.

22/01/2020

Changed behaviour of Python collectionsin Python 3.8

This may be relevant to Iromlab or OmSipCreator:

https://docs.python.org/3/whatsnew/3.8.html#collections

Example:

https://github.com/kieranjol/IFIscripts/commit/c6eedd9ec0821b7108f7a93f81bf043a6cb53d20

(Via Twitter)

18/01/2020

PinePhone

https://en.wikipedia.org/wiki/PinePhone

16/01/2020

Everything I know about SSDs

http://kcall.co.uk/ssd/index.html

Task failed successfully pin

https://www.hellovoid.online/product/task-failed-successfully-enamel-pin-pre-order

09/01/2020

Low disk space on boot partition

https://forums.linuxmint.com/viewtopic.php?t=265077

Solved by running following codeblock (as described here):

OLDCONF=$(dpkg -l|grep "^rc"|awk '{print $2}')
CURKERNEL=$(uname -r|sed 's/-*[a-z]//g'|sed 's/-386//g')
LINUXPKG="linux-(image|headers|ubuntu-modules|restricted-modules)"
METALINUXPKG="linux-(image|headers|restricted-modules)-(generic|i386|server|common|rt|xen)"
OLDKERNELS=$(dpkg -l|awk '{print $2}'|grep -E $LINUXPKG |grep -vE $METALINUXPKG|grep -v $CURKERNEL)
YELLOW="\033[1;33m"
RED="\033[0;31m"
ENDCOLOR="\033[0m"
sudo apt-get purge $OLDKERNELS

Update:: latest Mint releases can do this automatically. Open Update Manager, Preferences / Automation; check "Remove obsolete kernels and dependencies". See also here.

24/12/2019

The 2010s were supposed to bring the ebook revolution. It never quite came.

https://www.vox.com/culture/2019/12/23/20991659/ebook-amazon-kindle-ereader-department-of-justice-publishing-lawsuit-apple-ipad

15/12/2019

Microsoft Access: The Database Software That Won’t Die

https://medium.com/young-coder/microsoft-access-the-zombie-database-software-that-wont-die-5b09e389c166

12/12/2019

On Implementation of Open Standards in Software: To What Extent Can ISO Standards be Implemented in Open Source Software?

Some interesting observations on JPEG 2000:

http://www.diva-portal.org/smash/get/diva2:925474/FULLTEXT01.pdf

12/11/2019

Search Github gists by user

curl user:bitsgalore

04/11/2019

Two New Tools that Tame the Treachery of Files

https://blog.trailofbits.com/2019/11/01/two-new-tools-that-tame-the-treachery-of-files/

02/11/2019

EML attachments in O365 - a recipe for phishing

https://isc.sans.edu/forums/diary/EML+attachments+in+O365+a+recipe+for+phishing/25474/

01/11/2019

xkcd Earth Temperature Timeline

https://xkcd.com/1732/

31/10/2019

Manage Docker as a non-root user

https://docs.docker.com/install/linux/linux-postinstall/

30/10/2019

Linked multisession discs (CD-ROM)

http://www.gburner.com/online-help/what-is-multisession-disc.htm

"When you add more files in a subsequent session, a complete new file system is written for the new session, but it can include references to files recorded in the previous session; this is known as linked multisession."

History:

https://web.archive.org/web/20050211005128/http://www.roxio.com/en/support/cdr/multisessionhistory.html

28/10/2019

KPN Secure File Transfer

https://filetransfer.kpn.com/

23/10/2019

Location for AppImage files

Official recommendation is to use folder in home directory (see https://askubuntu.com/questions/1092742/where-should-i-put-appimages-files), but since homedir on home PC is on slow HD whereas OS + all other software is on fast SDD, I created a directory under root:

/Applications/

Then move AppImage files there.

16/10/2019

List of web archives

https://erichennekam.blogspot.com/2014/07/lijst-webarchieven-in-de-wereld-want.html

14/10/2019

Levels of Born-Digital Access

https://docs.google.com/document/d/1N1fG4AgyBEJISc3tk5rWAc_3ZYdDbdVK4_Dbi_TusYQ/edit

13/10/2019

Computer Files Are Going Extinct

https://onezero.medium.com/the-death-of-the-computer-file-doc-43cb028c0506

08/10/2019

Why most academic journals are following outdated publishing practices

https://blog.scholasticahq.com/post/why-academic-journals-are-following-outdated-publishing-practices/

04/10/2019

Running Iromlab wrapped commands manually

For testing only:

C:\Users\jkn010\AppData\Roaming\Python\Python36\site-packages\iromlab\tools\libcdio\win64\cd-info.exe -C H: --no-header --no-device-info --no-disc-mode --no-cddb --dvd > cd-info.log

"C:\Program Files\dBpoweramp\BatchRipper\Loaders\Nimbie\Pre-Batch\Pre-Batch.exe" --drive="H"  --logfile="prebatch.log" --passerrorsback="prebatcherrors.log"

"C:\Program Files\dBpoweramp\BatchRipper\Loaders\Nimbie\Load\Load.exe" --drive="H" --rejectifnodisc  --logfile=load.log" --passerrorsback="loaderrors.log"

"C:\Program Files (x86)\Smart Projects\IsoBuster\IsoBuster.exe" /d:H: /ei:test-h.iso /et:u /ep:oea /ep:npc /c /m /nosplash /s:1 /l:ib-h.log

01/10/2019

Software setup for Device Side FC5025 floppy controller, Linux

  1. Compile and install the software according to official documentation

  2. In file /etc/udev/rules.d/025_fc5025.rules, replace the two occurrences of SYSFS with ATTRS

  3. Run:

    sudo usermod -a -G floppy $USER

  4. Reboot the machine

Tested with Linux Mint 18.3 (Sylvia), equivalent to Ubuntu Xenial.

Sources: https://groups.google.com/forum/#!topic/bitcurator-users/K1BPIbdKoOY/discussion + email correspondence with Device Side Data (the creator of the FC5025).

28/09/2019

OfficeToPDF

OfficeToPDF is a command line utility that converts Microsoft Office 2003, 2007, 2010, 2013 and 2016 documents from their native format into PDF using Office's in-built PDF export features.

https://github.com/cognidox/OfficeToPDF

27/09/2019

QEMU QED

"ffmprovisr for QEMU":

https://eaasi.gitlab.io/qemu-qed/

25/09/2019

OpenShot video editor

https://www.openshot.org/

(Used this for iPRES video)

Kdenlive video editor

https://kdenlive.org/en/

(Used this for earlier video, I think).

Copy of Apache-related files on Linux machine

Directories /etc/apache2, /var/www and file etc/hosts copied to folder backup-webserver on backup disk BAKWA. Copied using:

  • sudo rsync -avhl /var/www/ ./var/www

  • sudo rsync -avhl /etc/apache2/ ./etc/apache2

  • sudo rsync -avhl /etc/hosts ./etc/

To be restored after reinstall.

ATA Secure Erase (erase SSD disk)

https://ata.wiki.kernel.org/index.php/ATA_Secure_Erase

23/09/2019

MIT Digital Media Transfer Kits

https://libguides.mit.edu/digmediatransfer

How to Sync Microsoft OneDrive with Linux

https://www.maketecheasier.com/sync-onedrive-linux/

20/09/2019

U.S. National Archives Digital Preservation Framework

https://github.com/usnationalarchives/digital-preservation

15/09/2019

Learning Machine Learning

https://cloud.google.com/products/ai/ml-comic-1/

11/09/2019

DiscImageCreator

https://github.com/saramibreak/DiscImageCreator

(via Twitter)

06/09/2019

Appendix A: Tables of File Formats | National Archives

https://www.archives.gov/records-mgmt/policy/transfer-guidance-tables.html

27/08/2019

Microservices in Audiovisual Archives

This document describes and examines strategies for designing lightweight microservice environments for the processing of digital, file-based, audiovisual data within an archive.

http://journal.iasa-web.org/pubs/article/view/70

22/08/2019

Fix Bless "not enough free space on the device to save file" errors

  1. Close Bless, and open preferences file (/home/johan/.config/bless/preferences.xml) in a text editor.
  2. Set temp dir by editing pref element with ByteBuffer.TempDir name attribute
  3. Add closing </preferences> tag and save the file. File should look like below:
    <preferences>
        <pref name="ByteBuffer.TempDir">/tmp/Bless</pref>
        <pref name="Default.NumberBase">Hexadecimal</pref>
        <pref name="Undo.Actions">100</pref>
        <pref name="View.Toolbar.Show">True</pref>
        <pref name="Undo.Limited">False</pref>
        <pref name="View.Statusbar.Show">True</pref>
        <pref name="Session.RememberWindowGeometry">True</pref>
        <pref name="Default.Layout.UseCurrent">False</pref>
        <pref name="Session.RememberCursorPosition">True</pref>
        <pref name="Session.AskBeforeLoading">False</pref>
        <pref name="View.Statusbar.Selection">True</pref>
        <pref name="Tools.Statistics.Show">False</pref>
        <pref name="View.Statusbar.Offset">True</pref>
        <pref name="Tools.ConversionTable.LEDecoding">False</pref>
        <pref name="Default.EditMode">Insert</pref>
        <pref name="Tools.ConversionTable.Show">True</pref>
        <pref name="Highlight.PatternMatch">True</pref>
        <pref name="Undo.KeepAfterSave">Memory</pref>
        <pref name="Session.LoadPrevious">True</pref>
        <pref name="View.Statusbar.Overwrite">True</pref>
        <pref name="Default.Layout.File">
    </preferences>
  4. Make the file read-only:
    chmod 0444 /home/johan/.config/bless/preferences.xml
    

Done!

Source here

Update: this didn't quite work, but a workaround is to enter the location of the temp dir (/tmp/Bless) directly in Bless' user interface as a text string (so don't use the file navigation widgets!).

16/08/2019

Philology and the digital writing process

https://filologiaunlp.files.wordpress.com/2018/06/ries_philology-and-the-digital-writing-process_2017.pdf

14/08/2019

JP2 images in Tika regression corpus

http://162.242.228.174/share/jp2.tgz

13/08/2019

Going Commando - Put Down The Mouse

https://blog.codinghorror.com/going-commando-put-down-the-mouse/

Mouseless Computing

https://weblogs.asp.net/jongalloway/Mouseless-Computing

Hack Attack: Mouse-less Firefox

https://lifehacker.com/hack-attack-mouse-less-firefox-139495

09/08/2019

Python reverse geocode

Reverse Geocode takes a latitude / longitude coordinate and returns the country and city.

https://pypi.org/project/reverse-geocode/

03/08/2019

Verloren jouw gegevens

Bron: https://twitter.com/Eijsbouts/status/1157591377624150016

31/07/2019

1995: kwart grote bedrijven op Internet

https://twitter.com/rutger_/status/1156629656533110787 (archived)

Delpher link: https://resolver.kb.nl/resolve?urn=ABCDDD:010870971:mpeg21:a0117

Gebruiken als context bij xxLINK presentatie!

Install Android on VirtualBox

29/07/2019

Install Android on VirtualBox

https://www.howtogeek.com/164570/how-to-install-android-in-virtualbox/

Then in VirtualBox change display option "Graphics Controller" to VBoxVGA, and enabled 3D acceleration, as per here.

Home Assistant

https://www.home-assistant.io/

27/07/2019

Renoise audio configuration

Added following lines to /etc/security/limits.conf, as per here:

johan - rtprio 99
johan - nice -10

11/07/2019

deja-dup / duplicity keeps asking for encryption password

See:

https://askubuntu.com/questions/462085/deja-dup-repeatedly-asks-encryption-password

Tried:

  • Re-install of duplicity
  • Changed ownership of a few dirs in home that were owned by root.

Start backup from terminal:

export DEJA_DUP_DEBUG=1
deja-dup --backup

Result: backup appears to be created, but after verification stage deja-dup asks for password again. Tail end of debug output:

DUPLICITY: .     self.gpg_failed()
DUPLICITY: .   File "/usr/lib/python2.7/dist-packages/duplicity/gpg.py", line 272, in gpg_failed
DUPLICITY: .     raise GPGError(msg)
DUPLICITY: .  GPGError: GPG Failed, see log below:
DUPLICITY: . ===== Begin GnuPG log =====
DUPLICITY: . gpg: WARNING: "--no-use-agent" is an obsolete option - it has no effect
DUPLICITY: . gpg: AES256 encrypted data
DUPLICITY: . gpg: encrypted with 1 passphrase
DUPLICITY: . gpg: decryption failed: Bad session key
DUPLICITY: . ===== End GnuPG log =====
DUPLICITY: . 
DUPLICITY: . 

DUPLICITY: ERROR 31 GPGError
DUPLICITY: . GPGError: GPG Failed, see log below:
DUPLICITY: . ===== Begin GnuPG log =====
DUPLICITY: . gpg: WARNING: "--no-use-agent" is an obsolete option - it has no effect
DUPLICITY: . gpg: AES256 encrypted data
DUPLICITY: . gpg: encrypted with 1 passphrase
DUPLICITY: . gpg: decryption failed: Bad session key
DUPLICITY: . ===== End GnuPG log =====
DUPLICITY: . 

10/07/2019

nwipe - securely erase disks (dban fork)

https://linux.die.net/man/1/nwipe

08/07/2019

Archaeology of the Amsterdam digital city; why digital data are dynamic and should be treated accordingly

https://www.tandfonline.com/doi/full/10.1080/24701475.2017.1309852

02/07/2019

Toward Environmentally Sustainable Digital Preservation

https://dash.harvard.edu/handle/1/40741399

25/06/2019

Deja-dup filling up home dir

After attaching a large external HD + including it in the backup scheme, deja-dup eats up all space of main HD. Cause: deja-dup writes some metadata and manifest files to home dir at:

~/.cache/deja-dup/

These files become very large (here: > 18 GB) which results in running out of disk space. Apparently causes problems for lots of deja-dup users, e.g. here, here. This post suggests to solve this by creating a symlink to ~/.cache/deja-dup/ on another disk with sufficient space:

mkdir /media/johan/BAKWA/.deja-dup-cache
mv ~/.cache/deja-dup/* /media/johan/BAKWA/.deja-dup-cache/
rmdir ~/.cache/deja-dup
ln -sf /media/johan/BAKWA/.deja-dup-cache ~/.cache/deja-dup

UPDATE: doesn't work, files are still written to home dir!! Interim solution: exclude external drive from deja-dup backup scheme, and back it up manually with rsync (no incremental backup though!).

20/06/2019

Format USB drive as ext4

List partitions:

df -h

Result:

Filesystem      Size  Used Avail Use% Mounted on
udev            3,9G     0  3,9G   0% /dev
tmpfs           789M  9,5M  780M   2% /run
/dev/sda1       227G  202G   14G  94% /
tmpfs           3,9G   34M  3,9G   1% /dev/shm
tmpfs           5,0M  4,0K  5,0M   1% /run/lock
tmpfs           3,9G     0  3,9G   0% /sys/fs/cgroup
cgmfs           100K     0  100K   0% /run/cgmanager/fs
tmpfs           789M   32K  789M   1% /run/user/1000
/dev/sdb1       1,9T  144M  1,9T   1% /media/johan/Elements4

So in this case we need to format /dev/sdb1. Unmount the disk:

sudo umount /dev/sdb1

Format as ext4:

sudo mkfs.ext4 /dev/sdb1

Change generic label to WEBARCH:

sudo e2label /dev/sdb1 WEBARCH

Done!

Copy directory tree with rsync

 #!/bin/bash
 # Script must be run as root!

 sourceDir=/media/johan/Elements4/webarcheologie
 destDir=/media/johan/WEBARCH/
 rsync -avhl --dry-run $sourceDir $destDir

Copy homedir:

#!/bin/bash
# Script must be run as root!

sourceDir=~
destDir=/media/johan/BAKWA/homedir-25022020/
rsync -avhl $sourceDir $destDir

17/06/2019

Filesystem Hierarchy Standard

https://www.linuxjournal.com/content/filesystem-hierarchy-standard

11/06/2019

Researcher, Don’t Make Your Readers Scream!

https://www.cl.cam.ac.uk/~lp15/Pages/Scream.html

07/06/2019

Quick MAME/MESS Philips CD-I Tutorial (Mame 0.172)

https://forums.launchbox-app.com/topic/29631-quick-mamemess-philips-cd-i-tutorial-mame-0-172/

30/05/2019

Reader Privacy: The New Shape of the Threat (Clifford Lynch)

https://publications.arl.org/16ivjbv/ (PDF link)

27/05/2019

LaTEX setup notes

First install the following packages:

sudo apt install texlive-latex-extra
sudo apt-get install texlive-bibtex-extra biber
sudo apt-get install texlive-fonts-recommended

Then download the OpenSans package here. Install using following steps:

  1. Copy doc/, fonts/, source/, and tex/ directories to /etc/texmf directory
  2. Run mktexlsr to refresh the file name database and make TEX aware of the new files.
  3. Run sudo updmap -sys --enable Map=opensans.map to make Dvips, dvipdf and pdfTEX aware of the new fonts.

26/05/2019

Digital Physical Carrier Illustrations

https://blog.matthewburgess.net/2019/05/digital-physical-carrier-illustrations.html

22/05/2019

Corrupt a file - The file corrupter you were looking for!

https://corrupt-a-file.net/

18/05/2019

Manuals HP ProDesk

https://support.hp.com/us-en/product/hp-prodesk-400-g3-microtower-pc/7638325/manuals

16/05/2019

Regex to convert smart quotes with regular ones (and vice-versa)

https://gist.github.com/zerolab/1633661

Convert dumb quotes to smart quotes in Python

https://gist.github.com/davidtheclark/5521432

Even easier, use SmartyPants:

https://pypi.org/project/smartypants/

09/05/2019

Library of Congress Web Archive Data Sets

https://labs.loc.gov/experiments/webarchive-datasets/

01/05/2019

Unraveling the JPEG

https://parametric.press/issue-01/unraveling-the-jpeg/

20/04/2019

Floppy disks are like Jesus

15/04/2019

ArchiveBox

ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).

https://archivebox.io/

02/04/2019

Text in PDF has no Unicode mapping

Short of AI, your best bet is to run OCR (tesseract) on these files.

https://lists.apache.org/thread.html/d25f20eda1c2094f0902e7b7092d829a64085b3b87aad2b8b346a453@%3Cuser.tika.apache.org%3E

23/03/2019

Identification of audio CD on Linux

Use cd-discid:

cd-discid /dev/sr1

Result:

b608ed0f 15 150 8656 19406 37656 48025 58358 71683 77998 90546 103443 117153 120751 132154 144223 157688 2287

Lookup in freedb using:

http://freedb.freedb.org/~cddb/cddb.cgi?cmd=cddb+query+b608ed0f+15+150+8656+19406+37656+48025+58358+71683+77998+90546+103443+117153+120751+132154+144223+157688+2287&hello=user+hostname+program+version&proto=3

Result:

200 rock b608ed0f Der Plan / Unkapitulierbar

Full record:

http://www.freedb.org/freedb/rock/b608ed0f

# xmcd
#
# Track frame offsets: 
#        150
#        8656
#        19406
#        37656
#        48025
#        58358
#        71683
#        77998
#        90546
#        103443
#        117153
#        120751
#        132154
#        144223
#        157688
#
# Disc length: 2287 seconds
#
# Revision: 0
# Processed by: cddbd v1.5.2PL0 Copyright (c) Steve Scherf et al.
# Submitted via: ExactAudioCopy v0.99pb5
#
DISCID=b608ed0f
DTITLE=Der Plan / Unkapitulierbar
DYEAR=2017
DGENRE=Electronic
TTITLE0=Wie der Wind weht
TTITLE1=Lass die Katze stehn!
TTITLE2=Man leidet herrlich
TTITLE3=Grundrecht
TTITLE4=Es heißt: die Sonne
TTITLE5=Gesicht ohne Buch
TTITLE6=Stille hören
TTITLE7=Flohmarkt der Gefühle
TTITLE8=Der Herbst
TTITLE9=Körperlos im Cyberspace
TTITLE10=Zu Besuch bei N. Senada
TTITLE11=Wie schwarz ist ein Rabe?
TTITLE12=Come Fly With Me
TTITLE13=Was kostet der Austritt?
TTITLE14=Die Hände des Astronauten
EXTD=
EXTT0=
EXTT1=
EXTT2=
EXTT3=
EXTT4=
EXTT5=
EXTT6=
EXTT7=
EXTT8=
EXTT9=
EXTT10=
EXTT11=
EXTT12=
EXTT13=
EXTT14=
PLAYORDER=

Python: cddb-py; Python 3 port here.

See also CDDB.

08/03/2019

Update forked Git repository

From here:

git remote add upstream https://github.com/sluwesjaantje/slomeslager.git
git fetch upstream
git checkout main
git rebase upstream/main
git push -f origin main

06/03/2019

ExifTool: report custom image properties to CSV file

Suppose we want to extract the Jpeg2000:NumberOfComponents field for each JP2 image:

exiftool -csv -Jpeg2000:NumberOfComponents /media/johan/Elements4/test/*.jp2 > exif.csv

Result:

SourceFile,NumberOfComponents
/media/johan/Elements4/test/HS-19640508-001.jp2,3
/media/johan/Elements4/test/HS-19640508-002.jp2,3
::

05/03/2019

ImageMagick: resize all images in directory to fixed width

mogrify -resize 1014 *.jpg

(Note: this changes the images in-place, so make a copy of the original images before doing this).

12/02/2019

ImageMagick: fix 'convert: not authorized'on PDF

https://alexvanderbist.com/posts/2018/fixing-imagick-error-unauthorized

10/02/2019

Emulation resources list (Ethan Gates)

https://github.com/EG-tech/emulation-resources

29/01/2019

Big List of Naughty Strings

The Big List of Naughty Strings is an evolving list of strings which have a high probability of causing issues when used as user-input data.

https://github.com/minimaxir/big-list-of-naughty-strings

03/01/2019

Twitter search advanced guide

https://espirian.co.uk/twitter-search-advanced-guide/

22/12/2018

Mounting Fritz.NAS under Linux Mint

Below instructions are for a fresh install. Based on:

https://dominicpratt.de/fritz-nas-unter-debianubuntu-einbinden/

  1. Open fstab in text editor as sudo:

    sudo xed /etc/fstab
    
  2. Add folllowing line to bottom (use vers=2.0 from FritzOS 7.21 onward; also last line of file must be empty):

    //192.168.178.1/FRITZ.NAS /media/fritzbox cifs credentials=/etc/samba/auth,vers=2.0,uid=1000,gid=1000 0
    
  3. Create the mount directory:

    sudo mkdir -p /media/fritzbox 
    
  4. Create file /etc/samba/auth:

    sudo touch /etc/samba/auth
    
  5. Edit as sudo:

    sudo xed  /etc/samba/auth
    
  6. Add username and password entries (must be FritzNAS uname + pwd, not the FritzBox ones!):

    username=johan
    password=dfh3476fh8((77&&
    
  7. Install the samba package (cifs-utils is also needed, but that is already part of the default Linux Mint install):

    sudo apt install samba 
    
  8. Finally mount:

    sudo mount -a
    

Done!

21/12/2018

Linux Mint: new install resuts in Grub Prompt when booting

https://forums.linuxmint.com/viewtopic.php?t=217509

Deark

A utility for file format and metadata analysis, data extraction, and image format decoding

https://github.com/jsummers/deark

24/192 Music Downloads ... and why they make no sense

https://people.xiph.org/~xiphmont/demo/neil-young.html

30/11/2018

mh virtual tape & library system.

https://github.com/markh794/mhvtl

Install script for Ubuntu 16.04:

https://gist.github.com/hrchu/3eb1c0aa9994df0328037fff04cd889d

Then run using:

sudo /etc/init.d/mhvtl start

24/11/2018

Tkinter bitmaps in Ubuntu (Python)

<https://stackoverflow.com/a/25223352/1209004

E.g.:

def main():
    """Main function"""

    appDir = get_main_dir()
    root = tk.Tk()
    root.iconphoto(True, tk.PhotoImage(file=os.path.join(appDir, 'icon.png')))
    myGUI = tapeimgrGUI(root)

24/10/2018

Bash: output to array, which is then parsed

# Get tape status, output to array (split at newline)
IFS=$'\n' tapeStatus=$(mt -f $TAPEnr status)

# Parse file number and block number from status output 
for item in ${tapeStatus[*]}
do
    if [[ $item == *"file number"* ]]; then
        # Split at equal sign, 2nd item is value
        tmp=$(echo $item | cut -f2 -d=)
        # Strip whitespace
        fileNumber="$(echo -e "${tmp}" | tr -d '[:space:]')"
        #echo $fileNumber
    fi

    if [[ $item == *"block number"* ]]; then
        # Split at equal sign, 2nd item is value
        tmp=$(echo $item | cut -f2 -d=)
        # Strip whitespace
        blockNumber="$(echo -e "${tmp}" | tr -d '[:space:]')"
        #echo $blockNumber
    fi

done

20/10/2018

Oxford Common File Layout

This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term object management best practices within digital repositories.

https://ocfl.io/

18/10/2018

Hex Editing for Archivists

http://www.av-rd.com/knowhow/

12/10/2018

Camelot: PDF Table Extraction for Humans (Python)

https://github.com/socialcopsdev/camelot/

01/10/2018

Update nodejs

https://askubuntu.com/questions/711834/unable-to-update-node-js-keeps-returning-to-old-version-ubuntu-15-04

try dat

https://try-dat.com/

25/09/2018

ReMarkable MarkDown editor

https://remarkableapp.github.io/index.html

Preservation Planning for Emerging Formats at the British Library

https://osf.io/65p7m/

14/09/2018

Docker files consume excessive amounts of disk space

See also moby/moby#21925.

E.g.:

sudo du -hx --max-depth=1 /var/lib

Result contains this entry:

25G	/var/lib/docker

There are probably more elegant/subtle ways to handle this, see e.g. https://lebkowski.name/docker-volumes/

Solution/workaround

Uninstall docker:

sudo apt-get remove docker docker-engine docker.io

Delete files:

sudo rm -rf /var/lib/docker

10/09/2018

BL Emerging Formats project

The Library’s ‘Emerging Formats’ project is focused on UK publications created for the mobile web, as interactive narratives or in database format.

https://britishlibrary.recruitment.northgatearinso.com/birl/pages/vacancy.jsf?latest=01001612

Caylin Smith and Ian Cooke report on the Emerging Formats project, which is investigating the collection management needs of published works that are created with digital formats that have significant software and hardware dependencies. They discuss the collection management challenges of these format types within the framework of UK NPLD.

http://journals.sagepub.com/doi/full/10.1177/0955749018785836

24/08/2018

Empty Trash on Linux machine from terminal

This works if Trash contains items that swere put there as superuser:

sudo rm -rf ~/.local/share/Trash/*

16/08/2018

How To Install WordPress with LAMP on Ubuntu 16.04

https://www.digitalocean.com/community/tutorials/how-to-install-wordpress-with-lamp-on-ubuntu-16-04

Use this to import kbresearch blog; then export to static site using:

https://wordpress.org/plugins/static-html-output-plugin/

06/08/2018

Digital transformation at Wellcome Collection

https://stacks.wellcomecollection.org/digital-transformation-at-wellcome-collection-639fb177aad6

27/07/2018

Search by file extension on Github

filename:ext extension:ext where ext is the extension you're interested in. You need both the filename and extension keywords to filter it down to only potential files of interest.

https://twitter.com/NKrabben/status/1022575556209074220

Example:

https://github.com/search?q=filename%3Awq1+extension%3Awq1

26/07/2018

Smallest possible […] file

This repository aims to collect the smallest possible syntactically valid files in different programming/scripting/markup languages.

https://github.com/mathiasbynens/small

25/07/2018

VisiData

VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.

http://visidata.org/

09/07/2018

Disk wiping and data forensics: Separating myth from science

https://www.techrepublic.com/article/disk-wiping-and-data-forensics-separating-myth-from-science/

30/06/2018

Excel Unusual

the home of the most unique Microsoft Excel animated spreadsheets

http://www.excelunusual.com/

29/06/2018

Hackmd.io

https://hackmd.io/

28/06/2018

It's Not Easy Being Green(e): Digital Preservation in the Age of Climate Change

https://scholarsphere.psu.edu/concern/generic_works/bvq27zn11p

23/06/2018

PREMIS/METS for scalability

https://wiki.archivematica.org/PREMIS/METS_for_scalability

17/06/2018

Markdown and Visual Studio Code

https://code.visualstudio.com/Docs/languages/markdown

Build an Amazing Markdown Editor Using Visual Studio Code and Pandoc

http://thisdavej.com/build-an-amazing-markdown-editor-using-visual-studio-code-and-pandoc/

15/06/2018

How to Measure Static Electricity

https://www.wikihow.com/Measure-Static-Electricity

11/06/2018

gedit on Windows

Install in MINGW:

pacman -S mingw-w64-x86_64-gedit

Add external plugin:

https://stackoverflow.com/questions/39360149/adding-external-plug-ins-to-gedit-in-windows

Get plugins here:

https://wiki.gnome.org/Apps/Gedit/ThirdPartyPlugins-v3.0

28/05/2018

Swisscows search engine

https://swisscows.com/

22/05/2018

Installation of Ace

If ELIFECYCLE / puppeteer error happens, try this (source):

sudo npm install @daisy/ace -g -unsafe-perm=true --allow-root

In case this results in:

sudo: npm: command not found

Then get location of npm:

which npm

Result:

/home/johan/.nvm/versions/node/v10.11.0/bin/npm

Create symbolic link:

sudo ln -s /home/johan/.nvm/versions/node/v10.11.0/bin/npm /usr/bin/npm

BUT ace now fails with:

Error: ENOENT: no such file or directory, mkdir '/home/johan/.local/state/DAISY Ace'

Fix: manually created directory "state/DAISY Ace" in ".local", now works!

20/05/2018

Proselint

Our goal is to aggregate knowledge about best practices in writing and to make that knowledge immediately accessible to all authors in the form of a linter for prose.

https://github.com/amperser/proselint/

18/05/2018

Memento Tracer

http://tracer.mementoweb.org/

17/05/2018

The Importance of EPUB and the Need for EPUB 4

https://w3c.github.io/publ-bg/docs/EPUB4_business_case.html

11/05/2018

What’s in a Name? On ‘Meaningfulness’ and Best Practices in Filenaming within the LAM Community

http://journal.code4lib.org/articles/13438

10/05/2018

Microsoft Office Supported File formats

Possibly more here.

07/05/2018

Ace Accessibility Checker for EPUB

https://daisy.github.io/ace/

Web service based on Ace:

http://bacc.dzb.de/

03/05/2018

List of open workflows and resources for A/V archiving

https://github.com/amiaopensource/open-workflows

26/04/2018

Integration of nonharvested web data into an existing web archive

http://netarkivet.dk/wp-content/uploads/IntegrationOfNonHarvestedData.pdf

Read Tape Contents (Linux)

https://www.linuxquestions.org/questions/linux-newbie-8/read-tape-contents-944371/

17/04/2018

Ten simple rules for structuring papers

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005619

12/04/2018

A successful Git branching model

http://nvie.com/posts/a-successful-git-branching-model/

Wikidata portal project

https://github.com/WikiDP/wikidp-portal

03/04/2018

Apache Web Server on Ubuntu 16.04

https://www.digitalocean.com/community/tutorials/how-to-install-the-apache-web-server-on-ubuntu-16-04

Restrict Apache Access to Localhost Only

In config file ports.conf, change this line:

Listen 80

into this:

Listen 127.0.0.1:80

See:

https://serverfault.com/questions/276963/make-apache-only-accessible-via-127-0-0-1-is-this-possible/276968#276968

Setting up multiple sites:

https://www.liberiangeek.net/2015/07/how-to-enable-and-run-multiple-websites-using-apache2-on-ubuntu-15-04/

28/03/2018

Script Ahoy

Community resource intended to provide helpful one-liners and script code specifically drawn from real-life examples in archives and libraries

https://dd388.github.io/crals/

Create static archived version of Wordpress blog

wget --recursive --no-clobber --span-hosts --page-requisites \
     --convert-links --no-parent -w 5 --random-wait \
     http://blog.kbresearch.nl >>wget.log 2>&1

This doesn't quite work the way it should:

  • If we leave out --span-hosts external stylesheets etc. are ignored, even if --page-requisites is used (don't want that)!
  • If we include --span-hosts externally referenced pages/sites are scraped as well (don't want that either!)

See also https://gist.github.com/dannguyen/03a10e850656577cfb57

Better approach:

  1. Scrape one single page:

    wget --page-requisites --span-hosts --convert-links --adjust-extension -w 5 --random-wait http://blog.kbresearch.nl/2015/07/07/why-pdfa-validation-matters-even-if-you-dont-have-pdfa/ >>$logFile 2>&1

This gives us the domains used for individual page resources, which we can subsequently feed into --domains. After some fiddling (we don't want to harvest +60 gravatar subdomains) this looks reasonable:

#!/bin/bash

url=http://blog.kbresearch.nl
domains=blog.kbresearch.nl,wp.com,researchkb.files.wordpress.com,googleapis.com,gstatic.com

logFile=wget.log
wget --mirror --page-requisites --span-hosts --convert-links --adjust-extension -w 5 --random-wait --domains=$domains $url >>$logFile 2>&1

24/03/2018

Difficulties of Timestamping Archived Web Pages

https://arxiv.org/abs/1712.03140

22/03/2018

swMATH

swMATH is a freely accessible, innovative information service for mathematical software. swMATH not only provides access to an extensive database of information on mathematical software, but also includes a systematic linking of software packages with relevant mathematical publications.

http://www.swmath.org/

17/03/2018

Windows previous versions documentation

https://docs.microsoft.com/en-us/previous-versions/windows/

12/03/2018

Wikidata for digital preservation portal

http://wikidp.org/

09/03/2018

Search files in UK web archive by magic pattern

See this thread on digipres.club for some context:

https://digipres.club/@joe/99650486509645352

Search URL:

https://www.webarchive.org.uk/shine/search?page=1&invert=&facet.fields=crawl_year&invert=&invert=&facet.fields=public_suffix&invert=&invert=&invert=&invert=&action=search&query=content_ffb:%220baddeed%22&totalCount=totalCount&sort=crawl_date&order=asc

Is Open Science ready for software containers?

One of our goals is to publish researcher's data, code, and executable Linux container all as files in a version controlled Dat repository. For this to be useful, a person should be able to execute these Linux environments (aka containers) anywhere

https://blog.datproject.org/2018/01/26/challenges-of-decentralized-hpc-containerization/

07/03/2018

Install OwnCloud desktop on Linux Mint 18.3

Instructions here, Ubuntu 16.04.

If updating results in warnings about package authentication, follow steps below:

owncloud/client#5287 (comment)

06/03/2018

Remove all XMP tags from a TIFF, except xmp-tiff ones

exiftool -xmp:all= "-all:all<xmp-tiff:all" MMKB19_000004012_00002_master.tiff

27/02/2018

Set non-standard maximum line length in pep8

Use --max-line-length option, e.g.:

pep8 --max-line-length=120 ~/omSipCreator/omSipCreator > pep8.txt

16/02/2018

Longevity of Optical Disc Media: Accelerated Ageing Predictions and Natural Ageing Data

https://www.degruyter.com/view/j/rest.2017.38.issue-3/res-2016-0032/res-2016-0032.xml?format=INT

COMPACT DISC SERVICE LIFE: AN INVESTIGATION OF THE ESTIMATED SERVICE LIFE OF PRERECORDED COMPACT DISCS (CD-ROM)

https://www.loc.gov/preservation/resources/rt/CDservicelife_rev.pdf

CD-ROM Longevity Research at LoC

https://www.loc.gov/preservation/scientists/projects/cd_longevity.html

CD-R and DVD-R RW Longevity Research at LoC

https://www.loc.gov/preservation/scientists/projects/cd-r_dvd-r_rw_longevity.html

15/02/2018

Python Macros in OpenOffice / LibreOffice

http://christopher5106.github.io/office/2015/12/06/openoffice-libreoffice-automate-your-office-tasks-with-python-macros.html

14/02/2018

Write Markdown with 8 Exceptional Open Source Editors

https://www.ossblog.org/markdown-editors/

06/02/2018

Discard unstaged changes to Git repo

git restore .

(see also stackoverflow)

05/02/2018

PREMIS in METS Toolbox

validate METS file against best practices:

http://pim.fcla.edu/validate

Schematron rules:

http://pim.fcla.edu/resources

31/01/2018

Siegfried format counts

sf -csv t/images | cut -d ',' -f 6 | sort | uniq -c | sort -r

Result:

  8 x-fmt/390
  7 fmt/645
  5 fmt/41
  5 fmt/101
  4 fmt/43
  3 x-fmt/62
  3 x-fmt/263
  3 x-fmt/111
  3 fmt/44
  2 fmt/661
  2 fmt/5
  2 fmt/17
 28 UNKNOWN
  1 x-fmt/92
  ::
  etc

(Source: Nick Krabbenhöft)

How to update a GitHub forked repository

https://stackoverflow.com/a/7244456

Create Windows context menu item

https://gist.github.com/bitsgalore/7c5da72277557b608c94

ExifTool sample files

https://sourceforge.net/p/exiftool/code/ci/master/tree/t/images/

Wine installation on Linux Mint 18.3

Not working, problem seems to correspond to issue here:

https://forums.linuxmint.com/viewtopic.php?f=47&t=260925

24/01/2018

Finding and installing packages in MSYS2

Create/update package database:

pacman -Fy

Result:

:: Synchronizing package databases...
 mingw32                    2.4 MiB  2.97M/s 00:01 [#####################] 100%
 mingw32.sig               96.0   B  0.00B/s 00:00 [#####################] 100%
 mingw64                    2.4 MiB  1695K/s 00:01 [#####################] 100%
 mingw64.sig               96.0   B  0.00B/s 00:00 [#####################] 100%
 msys                     855.8 KiB  4.24M/s 00:00 [#####################] 100%
 msys.sig                  96.0   B  0.00B/s 00:00 [#####################] 100%

Find package name from (sub) string:

pacman -Fsx iso-info

Result:

mingw32/mingw-w64-i686-libcdio 2.0.0-1
    mingw32/bin/iso-info.exe
    mingw32/share/man/man1/iso-info.1.gz
mingw64/mingw-w64-x86_64-libcdio 2.0.0-1
    mingw64/bin/iso-info.exe
    mingw64/share/man/man1/iso-info.1.gz

Install package:

pacman -S mingw-w64-x86_64-libcdi0

Uninstall package:

pacman -R mingw-w64-x86_64-libcdi0

Source: https://github.com/msys2/msys2/wiki/Using-packages

23/01/2018

SRU: select Mac-only CD-ROMs

Query:

extent any "cdrom* cd-rom*" and annotation any "Mac*" not annotation any "Win* PC*"

Result:

http://www.kbresearch.nl/tpxslt/?xml=http://jsru.kb.nl/sru/sru?query=extent%20any%20%22cdrom*%20cd-rom*%22%20and%20annotation%20any%20%22Mac*%22%20not%20annotation%20any%20%22Win*%20PC*%22&x-collection=GGC&maximumRecords=10&xsl=http://www.kbresearch.nl/xportal/brief.xsl

SRU: select Blu-Ray discs

Query:

extent any "blu*"

Result (only 5 hits, 23/1/2018):

http://www.kbresearch.nl/tpxslt/?xml=http://jsru.kb.nl/sru/sru?query=extent%20any%20%22blu*%22&x-collection=GGC&maximumRecords=10&xsl=http://www.kbresearch.nl/xportal/brief.xsl

18/12/2017

List contents of ISO image with 7-zip

Command:

7z l -slt iso9660.iso

Result:

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Listing archive: iso9660.iso

--
Path = iso9660.iso
Type = Iso
Created = 2017-06-30 18:31:33
Modified = 2017-06-30 18:31:33

----------
Path = nimbie.jpg
Folder = -
Size = 69424
Packed Size = 69424
Modified = 2017-06-30 13:23:38

Path = readme.txt
Folder = -
Size = 37
Packed Size = 37
Modified = 2017-06-30 13:25:20

UDF Bridge:

7z l -slt iso9660_udf.iso

Result:

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Listing archive: iso9660_udf.iso

--
Path = iso9660_udf.iso
Type = Udf
Comment = UDF Bridge demo
Cluster Size = 2048
Created = 2017-06-30 18:31:33

----------
Path = nimbie.jpg
Folder = -
Size = 69424
Packed Size = 69632
Modified = 2017-06-30 13:23:38
Accessed = 2017-06-30 18:31:33

Path = readme.txt
Folder = -
Size = 37
Packed Size = 2048
Modified = 2017-06-30 13:25:20
Accessed = 2017-06-30 18:31:33

13/12/2017

Apache Tika vs DROID

https://twitter.com/anjacks0n/status/941020183812100096

Esp.:

Without Tika, relying on on DROID, there would have been 25,887,108 unidentified resources - mostly plain text, JS, CSS etc. Without DROID, only 464 would go unidentified, but we'd have no format-version-level information. Combining tools is crucial for web archives.

Find which file(s) are located in damaged area of ISO image

Using iso-info:

iso-info -l -i dvd-erik.iso

Result:

  d [LSN     22]      4096 Jan 01 1970 01:00:00  .
  d [LSN     22]      2048 Jan 01 1970 01:00:00  ..
  - [LSN     26] 158549392 Jul 30 2008 09:33:59  086_10B21_078v_079r.TIF
  - [LSN  77443] 158633884 Jul 30 2008 09:34:08  087_10B21_079v_080r.TIF
  - [LSN 154901] 157658880 Jul 30 2008 09:34:19  088_10B21_080v_081r.TIF
  - [LSN 231883] 157877788 Jul 30 2008 09:34:29  089_10B21_081v_082r.TIF
    ::
    ::
  - [LSN 2092850] 158203324 Jul 30 2008 09:38:31  113_10B21_105v_106r.TIF
  - [LSN 2170098] 156139844 Jul 30 2008 09:38:41  114_10B21_106v_107r.TIF

Here LSN * 2048 = offset of start of file.

11/12/2017

DDrescue --try-again switch

From the manual:

--try-again Mark all non-trimmed and non-scraped blocks inside the rescue domain as non-tried before beginning the rescue. Try this if the drive stops responding and ddrescue immediately starts scraping failed blocks when restarted. If '--retrim' is also specified, mark all failed blocks inside the rescue domain as non-tried.

Useful if ddrescue remains stuck endlessly in "scraping failed blocks".

06/12/2017

Run .msi installer as admin

msiexec /a putty-64bit-0.70-installer.msi

23/11/2017

Useful VeraPDF command-lines

Disable PDF/A validation, only extract features:

verapdf --off --extract whatever.pdf > whatever.xml

Recursively process directory tree:

verapdf --recurse --off --extract myDir > whatever.xml

21/11/2017

Zenodo categories KBNL community

17/11/2017

Archivematica 1.6 Default Format Policy Registry

https://docs.google.com/spreadsheets/d/1g2vbAFBHWhsPRkNljbQBsKasMI-GCFTsQLol0cFT6js/edit#gid=0

Understanding Computer Technology

https://web.archive.org/web/20020201195007/http://www.geocities.com:80/SiliconValley/4031/

16/11/2017

Obtaining a list of all hyperlinks in an MS-Word document

https://superuser.com/questions/670324/obtaining-a-list-of-all-hyperlinks

05/11/2017

Environmental impact of academic conferences

https://www.researchgate.net/publication/318970823_Academic_conferences_urgently_need_environmental_policies

(Note: lots of DOIs in references don't resolve at all, or resolve to wrong location!)

http://onlinelibrary.wiley.com/doi/10.1111/1746-692X.12106/full

http://www.nature.com/news/a-clean-green-science-machine-1.17125?WT.mc_id=TWT_NatureNews

http://tyndall.ac.uk/sites/default/files/twp161.pdf

https://www.chemistryworld.com/opinion/cutting-the-science-travel-footprint/9567.article

01/11/2017

Use of objectCharacteristicsExtension element in PREMIS

Archivematica examples in:

https://www.loc.gov/standards/premis/examples.html

26/10/2017

Customise Pytlint error reporting for a project

https://stackoverflow.com/questions/43280486/pylint-error-message-e1101-module-lxml-etree-has-no-strip-tags-member

25/10/2017

File identification: Tika vs DROID

Paper by Andy Jackson (2012):

http://arxiv.org/pdf/1210.1714.pdf

20/10/2017

Extract URLs from PDF

https://twitter.com/andrewjbtw/status/920791293122396160

11/10/2017

Convert compressed TIFF to uncompressed TIFF

03/10/2017

For one file:

convert whatever_compressed.tif +compress whatever_uncompressed.tif

Multiple files:

#!/bin/bash


# Input and output directories
dirIn=~/tiffsDDD
dirOut=~/tiffsDDUncompressed

while IFS= read -d $'\0' -r file ; do
    # File basename 
    bName=$(basename -s .TIF "$file")
    
    # Output name
    outName=$bName.TIF
    
    # Full output paths
    fOut="$dirOut/$outName"
 
    # Convert to uncompressed TIFF
    convert  $file +compress $fOut

done < <(find $dirIn -type f -name "*.TIF" -print

Linux Mint 18.2 issues

28/09/2017

warcio

This library provides a fast, standalone way to read and write WARC Format commonly used in web archives.

https://github.com/webrecorder/warcio

25/09/2017

JWAT TOOLS

Includes ARC/WARC validation:

https://sbforge.org/display/JWAT/Running+JWAT-Tools

23/09/2017

Format Technology Lifecycle Analysis

https://tspace.library.utoronto.ca/bitstream/1807/75891/1/JASIST-format-technology-lifecycle-analysis.pdf

12/09/2017

Mimetypes of MS Office formats

https://technet.microsoft.com/en-us/library/ee309278(office.12).aspx

08/09/2017

Tika mimetype definitions

https://github.com/apache/tika/tree/master/tika-core/src/main/resources/org/apache/tika/mime

06/09/2017

Kaitai Struct

Kaitai Struct is a declarative language used for describe various binary data structures, laid out in files or in memory (...).

The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read described data structure from a file / stream and give access to it in a nice, easy-to-comprehend API.

http://kaitai.io/

29/08/2017

Suppress 'invalid-name' messages in Pylint output

Use -d option with invalid-name:

python3 -m pylint -d invalid-name boxvalidator.py > pylintjpylyzer.txt

24/08/2017

Zenodo: list all publications with "Digital Preservation" keyword in kbnl community

https://zenodo.org/communities/kbnl/search?page=1&size=20&q=keywords%253A%2522digital%2Bpreservation%2522

16/08/2017

JPEG 2000 drafts and freely available standards

https://github.com/Dzonatas/solution/tree/master/Documentation

15/08/2017

Remember Git login username/password

Following command will keep logibn credentials in cache for 1 hour:

git config --global credential.helper "cache --timeout=3600"

14/08/2017

Add path to LD_LIBRARY_PATH

For some reason I always forget this (below for OpenJPEG):

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Very large JP2 images

10/08/2017

How GIT commit to an existing tag

https://gist.github.com/danielestevez/2044589

15/06/2017

How to use HTML and CSS for printing

http://css4.pub/

Prince tool:

http://www.princexml.com/

waeasyprint (OS alternative):

http://weasyprint.org/

29/05/2017

E-READ

The goal of this Action is to improve scientific understanding of the implications of digitization, hence helping individuals, disciplines, societies and sectors across Europe to cope optimally with the effects.

http://ereadcost.eu/

12/05/2017

Huge List Of Example Files – Creative Commons

http://blog.online-convert.com/huge-list-of-example-files-creative-commons/

10/05/2017

Copy directory tree with Robocopy

robocopy sourceDir destDir /COPYALL /E /R:0 /DCOPY:T

E.g.:

robocopy H:\iromlabTestKBDepotNew M:\DigitalPreservation\optischeDragers\iromlabTestKBDepot /COPYALL /E /R:0 /DCOPY:T >robocopy.stdout 2>robocopy.stderr

19/04/2017

reading ISO image of data session of multisession (e.g. enhanced audio) CDs

Some useful links:

Good description of the problem:

https://lists.debian.org/debian-user/2005/01/msg02339.html

the sector numbers in the file system refer to sectors of the original CD rather than sectors of session2.iso. I don't know of a utility for rewriting them so that the file can be loop-mounted or written to an ordinary CD, but you can at least get a directory listing by using isoinfo with an offset:

isoinfo -i session2.iso -N 204345 -l

https://lists.gnu.org/archive/html/libcdio-devel/2010-02/msg00048.html

Esp.:

Remember, the path table and directory structure of the iso reflect the fact that the ISO filesystem starts on sector 222145 (49:23:70) of the CD. If it is burned to another CD at a different position, it won't work. Likewise, any program that reads the iso will need to be able to compensate for the offset. Try, for example: isoinfo -N 222145 -d -i '8mm-songs_to_love_and_die_by.iso'

Also (from same thread):

https://lists.gnu.org/archive/html/libcdio-devel/2010-02/msg00053.html

06/04/2017

Ensure correct encoding when writing a text file in Python

Default encoding for read/write write depends on locale settings, which can result in unexpected behaviour. See e.g.:

http://stackoverflow.com/questions/43256079/decoding-of-bytes-object-results-in-unexpected-invalid-utf-8-how-can-i-avoid

Solution: always set the encoding explicitly when opening a file for read/write in text mode. Example:

# Byte sequence corresponds to multiplication sign in UTF-8
myBytes = b'\xc3\x97'
# Decode to string 
myString = myBytes.decode('utf-8')

# Write myString to file
with open("myString.txt", "w", encoding="utf-8") as ms_file:
    ms_file.write(myString)

03/04/2017

Create symbolic link on Windows

In this case, create link to f:\Pandoc\pandoc.exe in directory c:\bin:

mklink pandoc.exe F:\Pandoc\pandoc.exe

30/03/2017

How to Create a List of Your Installed Programs on Windows

https://www.howtogeek.com/165293/how-to-get-a-list-of-software-installed-on-your-pc-with-a-single-command/

Powershell method:

Get-ItemProperty HKLM:\Software\Wow6432Node\Microsoft\Windows\CurrentVersion\Uninstall\* | Select-Object DisplayName, DisplayVersion, Publisher, InstallDate | Format-Table –AutoSize > installedPrograms.txt

28/03/2017

Guidelines for Using PREMIS with METS for exchange

https://www.loc.gov/standards/premis/guidelines2017-premismets.pdf

24/03/2017

Extract text from Epub

Apache Tika

java -jar tika-app-1.14.jar -t whatever.epub > whatever.txt

BUT doesn't return chapters in reading order!!

Textract (Python)

https://github.com/deanmalmgren/textract

Installs with errors under Windows; seems to work OK on Linux.

23/03/2017

Build process for Windows binaries of file/libmagic under Linux

https://github.com/nscaife/file-windows

28/02/2017

Change bit depth of WAV file

Saves output file as 24 bits / channel:

ffmpeg -i frogs-01.wav -codec pcm_s24le frogs-01-24-bit.wav

For list of all codec values:

ffmpeg -codecs

07/02/2017

Python relative imports for the billionth time

http://stackoverflow.com/questions/14132789/relative-imports-for-the-billionth-time

27/01/2017

FFmpeg - Extract Blu-Ray Audio

https://wiki.gentoo.org/wiki/FFmpeg_-_Extract_Blu-Ray_Audio

19/01/2017

Accessing raw devices under Windows, command line

From:

https://support.microsoft.com/nl-nl/help/100027/info-direct-drive-access-under-win32

To open a physical hard drive for direct disk access (raw I/O) in a Win32-based application, use a device name of the form

\\.\PhysicalDriveN

where N is 0, 1, 2, and so forth, representing each of the physical drives in the system.

To open a logical drive, direct access is of the form

\\.\X:

where X: is a hard-drive partition letter, floppy disk drive, or CD-ROM drive.

E.g. compute checksum on CD in d: drive:

 md5sum \\.\D:

Accessing raw devices in Python (under Windows)

Access to logical drives:

http://stackoverflow.com/q/6522644/1209004

Write access:

http://stackoverflow.com/q/7135398/1209004

Reading raw disks with Python:

http://blog.lifeeth.in/2011/03/reading-raw-disks-with-python.html

Isoparser

https://github.com/barneygale/isoparser

15/01/2017

How to host your static site with HTTPS on GitHub Pages and CloudFlare

https://developer.ubuntu.com/en/blog/2016/02/17/how-host-your-static-site-https-github-pages-and-cloudflare/

BUT this will make accessing the site CAPTCHA hell for Tor users: https://support.cloudflare.com/hc/en-us/articles/203306930-Does-CloudFlare-block-Tor-

Alternatives:

  • CERTBot / Letsencrypt: requires server access
  • Github pages has built-in https support, but only for github.io domains.

11/01/2017

How to Host your Python Package on PyPI with GitHub

https://www.codementor.io/arpitbhayani/host-your-python-package-using-github-on-pypi-du107t7ku

One everything is set up, for each new release the basic steps are:

  1. Update version number in main code
  2. Update link to download_url (in my case this is automated)
  3. Commit changes & push
  4. Add tag: git tag -a x.y.z -m "whatever"
  5. git push --tags
  6. python setup.py register -r pypi
  7. python setup.py sdist upload -r pypi

09/01/2017

CD/DVD Carrier checksums vs ISO image checksums

The md5sum of a "burnt" CD can be different than the md5sum of the associated iso file and not indicate an error

http://twiki.org/cgi-bin/view/Wikilearn/CdromMd5sumsAfterBurning

See also:

http://superuser.com/questions/220082/how-to-validate-a-dvd-against-an-iso

06/01/2017

Books and Literature Status Review 2016

https://warekennis.nl/wp-content/uploads/2013/03/BOOKS-AND-LITERATURE-STATUS-REVIEW-2017-.pdf

02/01/2017

Use ffmpeg / ffprobe to get tech properties from audio file

ffprobe track01.cdda.wav -show_format -show_streams > properties.txt

Result (file properties.txt):

[STREAM]
index=0
codec_name=pcm_s16le
codec_long_name=PCM signed 16-bit little-endian
profile=unknown
codec_type=audio
codec_time_base=1/44100
codec_tag_string=[1][0][0][0]
codec_tag=0x0001
sample_fmt=s16
sample_rate=44100
channels=2
channel_layout=unknown
bits_per_sample=16
id=N/A
r_frame_rate=0/0
avg_frame_rate=0/0
time_base=1/44100
start_pts=N/A
start_time=N/A
duration_ts=8233176
duration=186.693333
bit_rate=1411200
max_bit_rate=N/A
bits_per_raw_sample=N/A
nb_frames=N/A
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=0
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
DISPOSITION:timed_thumbnails=0
[/STREAM]
[FORMAT]
filename=track01.cdda.wav
nb_streams=1
nb_programs=0
format_name=wav
format_long_name=WAV / WAVE (Waveform Audio)
start_time=N/A
duration=186.693333
size=32932748
bit_rate=1411201
probe_score=99
[/FORMAT]

XML output:

ffprobe track01.cdda.wav -show_format -show_streams -print_format xml > properties.xml

01/01/2017

Update the Fritz!Box Mediaserver file index from a script

https://blog.heckel.xyz/2012/12/07/script-refresh-the-fritzmediaserver-dlna-index-of-the-fritzbox-6360-cable/

Script:

https://blog.heckel.xyz/wp-content/uploads/2012/12/fritzbox-dlna-refresh

19/12/2016

AMIA open workflows and resources for A/V archiving

https://github.com/amiaopensource/open-workflows

16/12/2016

NYPL Specifications for Audio and Moving Image Digitization

https://confluence.nypl.org/display/DIG/Specifications+for+Audio+and+Moving+Image+Digitization

07/12/2016

Mediags

Mediags is a console program that scans directories for media files and verifies the integrity of those files. Detailed content reports may optionally be produced.

https://mediags.codeplex.com/

(Binaries windows only)

01/12/2016

Browsers, not apps, are the future of mobile:

https://medium.com/swlh/browsers-not-apps-are-the-future-of-mobile-c552752ff75#.ilc1zlj1a

27/11/2016

Appear.in

Video conversations with up to 8 people for free. No login required — no installs

https://appear.in/

31/10/2016

A guide to Wikidata, SPARQL, and WDQS

https://www.wikidata.org/wiki/User:TweetsFactsAndQueries/A_Guide_To_WDQS

28/10/2016

PDFx

Extract references and metadata from PDF documents, and download all referenced PDFs:

https://www.metachris.com/pdfx/

24/10/2016

Explanation of need for Multi Threading GUI programming

http://stackoverflow.com/questions/13343096/explanation-of-need-for-multi-threading-gui-programming

22/10/2016

Digital Open Access Identifier

http://doai.io/

19/10/2016

Wikidata:WikiProject Informatics/File formats

https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/File_formats

14/10/2016

Python debugging tips

http://stackoverflow.com/questions/1623039/python-debugging-tips

30/09/2016

A Slow-Motion Revolution (history of the CD-ROM)

http://www.filfre.net/2016/09/a-slow-motion-revolution/

29/09/2016

An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter

http://journal.code4lib.org/articles/11358

25/09/2016

Check the Accessibility of a PDF Document (online)

http://checkers.eiii.eu/en/pdfcheck/

23/09/2016

Python event scheduler and queue modules

https://docs.python.org/3.6/library/queue.html

https://docs.python.org/3.6/library/sched.html

And perhaps:

https://docs.python.org/3.6/library/threading.html#module-threading

Possibly usable in CD imaging workflow (esp. interaction with operator input).

13/09/2016

media-autobuild_suite

This Windows Batchscript setups a MinGW/GCC compiler environment for building ffmpeg and other media tools under Windows. After building the environment it retrieves and compiles all tools. All tools get static compiled, no external .dlls needed (with some optional exceptions)

https://github.com/jb-alvarado/media-autobuild_suite

By default this doesn't build the ffmpeg optional libraries (incl. cddio). In order to build them, if the batch file prompts you to Choose ffmpeg and mpv optional libraries?, select option 4 (All available external libs). Alternatively (if you accidentally ran the build with the default option), open file media-autobuild_suite.ini and set the value of ffmpegChoice to 4:

ffmpegChoice=4

Libcdio windows binaries

http://lrn.no-ip.info/packages/i686-w64-mingw/libcdio/0.93-1/

12/09/2016

Cdrdao Windows binaries

http://www.student.tugraz.at/thomas.plank/

08/09/2016

Discid tool

http://discid.sourceforge.net/

Tried flactag fork, which gives following output for CD-ROM:

Query failed: no actual audio tracks on disc: CDROM or DVD?

So might be useful for distinguishing between audio CD's and CD-ROMs (tarball contains Windows binary).

disktype tool

http://disktype.sourceforge.net/

Output audio CD:

Block device, size 690.4 MiB (723972096 bytes)
CD-ROM, 14 tracks, CDDB disk ID D912690E
Track 1: Audio track, 37.35 MiB (39163152 bytes),   3 min 42 sec
Track 2: Audio track, 87.89 MiB (92163120 bytes),   8 min 42 sec 
::
Track 13: Audio track, 37.22 MiB (39029088 bytes),   3 min 41 sec
Track 14: Audio track, 78.14 MiB (81931920 bytes),   7 min 44 sec

CD-ROM:

Block device, size 223.2 MiB (233990144 bytes)
CD-ROM, 1 track, CDDB disk ID 0205F301
Track 1: Data track, 223.2 MiB (233994240 bytes)
  ISO9660 file system
    Volume name "0305132335"
    Preparer    "CEQUADRAT 32BIT ISO-9660 FORMATTER COPYRIGHT (C) 1995-1998 BY CEQUDRAT GMBH"
    Data size 222.9 MiB (233682944 bytes, 114103 blocks of 2 KiB)
    Joliet extension, volume name "0305132335"

Enhanced audio CD:

Block device, size 223.2 MiB (233990144 bytes)
CD-ROM, 22 tracks, CDDB disk ID 4B113416
Track 1: Audio track, 9.627 MiB (10094784 bytes),   0 min 57 sec
Track 2: Audio track, 30.01 MiB (31462704 bytes),   2 min 58 sec
::
Track 20: Audio track, 41.33 MiB (43340304 bytes),   4 min 05 sec
Track 21: Audio track, 47.73 MiB (50048208 bytes),   4 min 43 sec
Track 22: Data track, 90.84 MiB (95252480 bytes)

DVD:

Block device, size 223.2 MiB (233990144 bytes)
CD-ROM, 1 track, CDDB disk ID 023BFD01
Track 1: Data track, 2.197 GiB (2358986752 bytes)
  Apple partition map, 2 entries
  Partition 1: 31.50 KiB (32256 bytes, 63 sectors from 1)
    Type "Apple_partition_map"
  Partition 2: 2.737 GiB (2938324992 bytes, 5738916 sectors from 1108)
    Type "Apple_HFS"
    HFS Plus file system
      Volume size 2.737 GiB (2938324992 bytes, 1434729 blocks of 2 KiB)
      Volume name "BelPop Marc Moulin"
  UDF file system
    Sector size 2048 bytes
    Volume name "BelPop Marc Moulin"
    UDF version 1.50
  ISO9660 file system
    Volume name "BELPOPMARCMOULIN"
    Data size 2.737 GiB (2938894336 bytes, 1435007 blocks of 2 KiB)
    Joliet extension, volume name "BelPop Marc Moul"

(note DVD is identified as CD-ROM; doesn't realy matter as extraction fronm DVD is identical to data CD-ROM).

Compiles without problems under Windows (using Cygwin), but doesn't seem to be able to access cd-devices. E.g.:

disktype /dev/sr0

Result:

--- /dev/sr0
Block device, size 332.6 MiB (348790784 bytes)
disktype: Data read failed at position 0: Invalid request code

Or:

disktype D:\

Result:

--- D:\
disktype: D:\: Is a directory

Or:

disktype D

Result:

--- D
disktype: Can't stat D: No such file or directory

Perhaps try cdrdao scanbus?

07/09/2016

WMI queries from the command line (Windows)

http://www.robvanderwoude.com/wmic.php

Example - get information about optical drives:

wmic cdrom  where mediatype!='unknown' get > test.txt

06/09/2016

Libcdio & pycdio

The GNU Compact Disc Input and Control library (libcdio) contains a library for CD-ROM and CD image access. Applications wishing to be oblivious of the OS- and device-dependent properties of a CD-ROM or of the specific details of various CD-image formats may benefit from using this library.

http://www.gnu.org/software/libcdio/

Python interface:

https://pypi.python.org/pypi/pycdio/

01/09/2016

Python data entry form example

http://codereview.stackexchange.com/questions/52397/a-general-purpose-gui-data-input-with-validation-but-unclear-about-best-object

31/08/2016

Imaging and image format for mixed mode CDs

Brown, "Developing Virtual CD-ROM Collections" (2012):

http://www.ijdc.net/index.php/ijdc/article/view/216/285

Page 13:

  • Create BIN/TOC file with cdrdao using:

    cdrdao read-cd --read-raw --device 1,0,0 --datafile allmy.bin allmy.toc

  • Author developed SheepShaver extension that allows these images to be read by emulator

Caveats:

  • The given cdrdao command only extracts one session (I guess the Voyager CD-ROMs only contain one session with both the data and audio tracks, although the paper isn't entirely clear about this).
  • In case of a CD with multiple sessions one would have to repeat the command for each of those (result: one separate image for each session)
  • Hybrid CD-ROMs are not supported by any of the most widely-used emulators (also stressed by author)

Jackson (BL):

http://anjackson.net/keeping-codes/practice/developing-a-robust-migration-workflow-for-preserving-and-curating-handheld-media.html

On multisession carriers:

While CD-ROM, DVD and HFS+ format disks are reasonably well covered by this approach, there are some important limitations. For example, the optical media formats all support the notion of ‘sessions’ – consecutive additions of tracks to a disk. This means that a given carrier may contain a ‘history’ of different versions of the data. By choosing to extract a single disk image, we only expose the final version of the data track, and any earlier versions, sessions or tracks are ignored. For our purposes, these sessions are not significant, but this may not be true elsewhere.

BUT sessions (at least on commercially manufactured carriers) typically don't contain different versions of the same data, but data that are completely different! Example: many 'enhanced' audio CDs that contain one session with all audio tracks, and another session with a data track. So sessions are significant!

BL workflow for REd Book (audio) and Yellow Book (mixed mode) carriers:

  • Image to MDS/MDF format
  • Then post-process MDS/MDF file with IsoBuster

But it's not entirely clear if the MDS/MDF can handle multisession carriers?

I found this in the Knowledge Base of the developer of the format:

http://support.alcohol-soft.com/en/knowledgebase.php?postid=15034&title=Restrictions+for+creating+image+files

Image making wizard will always allow the user to create mds/mdf ccd/img/sub.

But ISO format, only for those disc's that contain 1 data track(mode1 or mode2form1).

For cue/bin only for one session disc. if the original disc is a multi-session one, then the cue/bin would not be available and If the user chooses read sub-channel, the cue/bin and iso would be unavailable as well . because iso and cue/bin could not save sub channel data.

So apparently MDS/MDF does support multisession after all!

Good overview of disc image formats here:

http://www.theisozone.com/blogs/homebrew/burning-image-file-type-explained/

23/08/2016

Sheepshaver (Macintosh emulator)

Includes links to ROM and startup images:

http://www.redundantrobot.com/#/sheepshaver

Preserving and Emulating Digital Art Objects

Report by Cornell University:

https://ecommons.cornell.edu/handle/1813/41368

CD-ROM FAQ

Some useful info on Mac / PC images and hybrids:

http://www.macdisk.com/faqcden.php

22/08/2016

CDRWIN manual

Contains lots of info on optical carrier and disc image formats (e.g. BIN/CUE):

http://web.archive.org/web/20070221154246/http://www.goldenhawk.com/download/cdrwin.pdf

18/08/2016

Python requests fetch a file from a local url

http://stackoverflow.com/questions/10123929/python-requests-fetch-a-file-from-a-local-url

17/08/2016

Computer Display Calibration 101

https://blog.codinghorror.com/computer-display-calibration-101/

Bias Lighting

https://blog.codinghorror.com/bias-lighting/

16/08/2016

Recursively find/count files with specific extension

Find all files with .pdf extension:

find . -type f -name '*.pdf'

Count all files with .pdf extension:

find . -type f -name '*.pdf'| wc -l

PyRomInfo

Esp. 'useful links' section:

https://github.com/garbear/pyrominfo

21/07/2016

One pixel is worth three thousand words

Representation of 1 pixel in many different formats:

http://cloudinary.com/blog/one_pixel_is_worth_three_thousand_words

20/07/2016

The Programming Historian

Online tutorials on APIs, Data Management, Data Manipulation, Distant Reading, Linked Open Data, Mapping and GIS, Network Analysis, Omeka Exhibit Building, Web Scraping and Programming with Python:

http://programminghistorian.org/lessons/

14/07/2016

Writerperfect library

Supports lots of (old) Office-related formats + includes many conversion tools:

https://launchpad.net/ubuntu/+source/writerperfect/0.9.5-1

06/07/2016

Horrifying PDF experiments

https://github.com/osnr/horrifying-pdf-experiments

05/07/2016

Python classes simple examples

https://en.wikibooks.org/wiki/A_Beginner%27s_Python_Tutorial/Classes

26/06/2016

How To Install Linux Mint to SSD and HHD /home

https://forums.linuxmint.com/viewtopic.php?t=177915

23/06/2016

Python metadata libraries

(Source: Nick Krabbenhöft on Twitter)

22/06/2016

Library of Congress Audio Compact Disc METS Profile

http://www.loc.gov/standards/mets/profiles/00000007.html

Creating Virtual CD-ROM Collections

http://dx.doi.org/10.2218/ijdc.v4i2.107

From Imaging to Access - Effective Preservation of Legacy Re-movable Media

http://www.digpres.com/publications/woodsbrownarch09.pdf

Example METS file (note that apparently they combine multiple ISOs in one AIP):

http://webapp1.dlib.indiana.edu/virtual_disk_library/index.cgi/4252478/mets

BL METS profile - Sound Recordings 2

http://www.bl.uk/profiles/sound/METS_profile.pdf

19/06/2016

Linux File System Hierarchy

https://www.blackmoreops.com/2015/06/18/linux-file-system-hierarchy-v2-0/

Digital Dark Age Klaxon

https://youtu.be/a_6CZ2JaEuc

17/06/2016

SIP creator tools

14/06/2016

Characterisation of CD-ROMs

31/05/2016

Validate XML against user-defined XSD schema

xmllint --noout -schema schema.xsd whatever.xml

27/05/2016

Recursively compute md5 checksums for all files in directory tree

find -type f -exec md5sum "{}" + > checksums.md5

Source: http://askubuntu.com/a/318534. Works also under Cygwin.

Issue: output also includes MD5 sum of output file (which become invalid once anything is written to the file).

23/05/2016

Generate new access JP2 from master

  1. Convert master JP2 to TIFF using Kakadu (this preserves any embedded ICC profiles):
    kdu_expand -i master.jp2 -o master.tiff

  2. Convert TIFF to lossy JP2 with Aware via jpwrappa:
    jpwrappa -m -p C:\jpwrappa\profiles\optionsKBAccessLossy_2014.xml master.tiff access.jp2

(The -m switch can be omitted, in which case there is no need for Exiftool.)

19/05/2016

Disc robots

18/05/2016

Digital newspapers

10/05/2016

Use docx document as template in Pandoc

Use the --reference-docx switch:

pandoc -S --reference-docx=template.docx test.md -o test.docx 

26/04/2016

Rollback git repo to previous state + push changes to remote

Rollback to previous state:

git reset --hard <tag/branch/commit id>

Commit changes:

git push ... -f

Example:

git reset --hard 2dbe067c1674dcf9a23104c4b64b772e1550ba29
git push origin master -f

Mimetype Comparison DROID, Tika, File, April 2016

http://162.242.228.174/mimes/mime_comparisons.html

Common Crawl

An open repository of web crawl data that can be accessed and analyzed by anyone

https://commoncrawl.org/

25/04/2016

Tika-python

A Python port of the Apache Tika library that makes Tika available using the Tika REST Server.

https://github.com/chrismattmann/tika-python

Manipulating PDFs with Python

https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167

22/04/2016

Introduction to the Bash Command Line (The Programming Historian)

http://programminghistorian.org/lessons/intro-to-bash

20/04/2016

Python wrapper for EpubCheck

https://github.com/titusz/epubcheck

Tegelspreukmaker

http://www.tegelspreukmaker.nl/

14/04/2016

Sozi presentation software

Looks a bit similar to Prezi, but OS (presentation as SVG):

http://sozi.baierouge.fr/

06/04/2016

Android screen rotate in VirtualBox

Press F9, F10, F11 or F12 twice. "Auto-rotate screen" option in Android Settings must be enabled.

04/04/2016

HTML codeblock hell in Wordpress

Following codeblock is not rendered correctly in Wordpress:

<pre><code>&lt;div&gt;test&lt;/div&gt;</code></pre>

Workaround is to replace forward slash in closing tag by entity reference:

<pre><code>&lt;div&gt;test&lt;&#47;div&gt;</code></pre>

29/03/2016

Caradoc - a PDF parser and validator

https://github.com/ANSSI-FR/caradoc

Note: current Debian package of Opam not recent enough, so used the instructions under "Binary distribution" at https://opam.ocaml.org/doc/Install.html. Installs binary in /usr/local/bin.

Make file initially didn't work because ocamlfind could not be found. Fixed by typing:

eval $(opam config env)

After this it compiles without any errors.

24/03/2016

Seeing the Double Rainbow: The Trials and Tribulations Working with Optical Media

Includes MiniDisc:

http://ndsr.nycdigital.org/seeing-the-double-rainbow-the-trials-and-tribulations-working-with-optical-media/

15/03/2016

Ebooklib

Python library that reads/writes EPUB, including EPUB 3:

https://github.com/aerkalov/ebooklib

Example, create EPUB from HTML:

https://gist.github.com/bitsgalore/4c830a301f33f584c041

CB infographics e-books in Nederland

http://www.cb.nl/nieuws/alle-relevante-data-over-e-books-in-nederland/

http://www.cb.nl/nieuws/e-bookbarometeblijft-groeien/

14/03/2016

Encyclopedia of Graphics File Formats

http://fileformats.archiveteam.org/wiki/Encyclopedia_of_Graphics_File_Formats

HTML5 is the New Flash

http://homepages.cwi.nl/~steven/Talks/2015/11-06-xml-amsterdam/

05/03/2016

Excel to XML: How to Transfer Your Spreadsheet Data Onto an XML File

This works (but what's referred to as a "schema" isn't really a schema at all):

https://blog.udemy.com/excel-to-xml/

How To Export an Excel 2010 Worksheet to XML

Similar to above, but uses XSD Schema directly, might be better:

https://bitwizards.com/blog/november-2010/how-to-export-an-excel-2010-worksheet-to-xml

23/02/2016

Reference rot in scholarly articles

Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot

Playback WARC

Web archive player:

https://github.com/ikreymer/webarchiveplayer

20/02/2016

Search and replace string for all files in directory tree

E.g. replace every occurrence of /tmp/"$fileIn" with /tmp/"$(cat /dev/urandom | tr -cd 'a-f0-9' | head -c 16)":

find /home/johan/cajascripts -type f -print0 | xargs -0 sed -i 's/\/tmp\/"$fileIn"/\/tmp\/"$(cat \/dev\/urandom | tr -cd 'a-f0-9' | head -c 16)"/g'

18/02/2016

Save blog with archiveBot

  • Don't save offsite links
  • Use 'blogs' ignore pattern

Command (I think?):

!archive http://www.flipvandyke.nl/ --no-offsite-links --ignore-sets=blogs

28/01/2016

Recovering data from broken disk under Ubuntu

https://help.ubuntu.com/community/DataRecovery

14/01/2016

Links to freely available EPUB files with DRM

07/01/2016

Determine actual compression ratio of each quality layer in JP2

If N = number of layers, then first extract layers i and below to a separate JP2 with Aware j2kdriver tool:

j2kdriver -i foo.jp2 -ql (N-i+1) -t JP2 -o foo_i.jp2

Then use jpylyzer to compute the compression ratio of resulting image.

Example - input image with 11 quality layers

Create derived image for each quality layer:

j2kdriver -i MMAD01_000001001_00011_master.jp2 -ql 11 -t JP2 -o layer1.jp2
j2kdriver -i MMAD01_000001001_00011_master.jp2 -ql 10 -t JP2 -o layer2.jp2
::
::
j2kdriver -i MMAD01_000001001_00011_master.jp2 -ql 1 -t JP2 -o layer11.jp2

09/12/2015

Change last modified date of file

touch -d "1 January 1768" myfile.txt

30/11/2015

Stop laptop from re-booting after shutdown

This happened to my HP ProBook 640 G1. Workaround: in BIOS, disable "wake on LAN". Source: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1470723/comments/13

24/11/2015

Comparison of CD rippers

http://wiki.hydrogenaud.io/index.php?title=Comparison_of_CD_rippers

10/11/2015

Convert Word document to PDF from command line

http://superuser.com/questions/789968/windows-7-batch-command-line-to-save-as-pdf-file-for-word-2013-docx-file

06/11/2015

Beeld & Geluid Preservation Metadata Dictionary

http://publications.beeldengeluid.nl/pub/84

05/11/2015

Yale Library Digital Preservation System Requirements

http://web.library.yale.edu/sites/default/files/files/YULDPSHighLevelRequirementsUseCasesDiagrams.pdf

19/10/2015

Best Way To Merge A (GitHub) Pull Request

http://blog.differential.com/best-way-to-merge-a-github-pull-request/

Third option (Catch Feature Up with Master by Rebasing, then fast-forward Merge).

16/10/2015

Handboek informaticavaardigheden UvA

http://liv.science.uva.nl/index.html

Misschien delen (her)bruikbaar voor interne cursussen e.d.

10/10/2015

Add right-click context menu items in Ubuntu /Linux Mint

Ubuntu with Nautilus file manager - Nautilus Actions:

http://www.pcsteps.com/4434-add-right-click-commands-linux-mint-ubuntu/

Linux Mint Cinnamon with Nemo file manager:

http://www.pcsteps.com/4434-add-right-click-commands-linux-mint-ubuntu/

Linux Mint Mate with Caja file manager:

http://www.ethanjoachimeldridge.info/tech-blog/caja-exifstrip-context-action

10/09/2015

Create floppy image from arbitrary files

From http://stackoverflow.com/a/11202773:

Suppose I want to create a floppy image containing file oakcdrom.sys:

dd bs=512 count=2880 if=/dev/zero of=oakcd.img
mkfs.msdos oakcd.img
mcopy -i oakcd.img oakcdrom.sys ::/

Inspect contents:

mdir -i oakcd.img

27/08/2015

Create image of 3.5" DOS / Windows floppy

General command:

ddrescue -d -n -b 512 /dev/fd0 myfloppy.img myfloppy.log 

To get name of device:

lsblk

Result:

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 465,8G  0 disk 
├─sda1   8:1    0 457,9G  0 part /
├─sda2   8:2    0     1K  0 part 
└─sda5   8:5    0   7,9G  0 part [SWAP]
sdb      8:16   0  29,8G  0 disk 
sdc      8:32   1   1,4M  1 disk

So in this case it is /dev/sdc. Create the image with:

sudo ddrescue -d -n -b 512 /dev/sdc myfloppy.img myfloppy.log

Optionaly use dosfsck tool to check the integrity of the file system (assuming it is a DOS file system). Use following command:

echo "n" |dosfsck -t -r myfloppy.img

The -t option checks for bad clusters, but this only works in combination with -a (automatically repair) or -r (interactively repair). So to do the check without automatic repair or input from user we use -r and then use a pipe to prevent any changes being made. Result:

fsck.fat 3.0.26 (2014-03-07)
Cluster 2845 is unreadable.
Cluster 2846 is unreadable.
Cluster 2847 is unreadable.
Cluster 2848 is unreadable.
Perform changes ? (y/n) myfloppy.img: 33 files, 2304/2847 clusters

Git as synchronisation tool links

Check integrity of git rpo:

http://stackoverflow.com/questions/5585388/which-git-commands-perform-integrity-checks

(Bottom line: use git fsck.)

How to shrink the git folder:

http://stackoverflow.com/questions/5613345/how-to-shrink-the-git-folder

25/05/2015

Exiting and re-entering GUI in Linux Mint

Exit GUI:

 Ctrl-Alt-F1

Re-enter:

Ctrl-Alt-F8

18/05/2015

Make Markdown preview in ReText work

From https://bugs.launchpad.net/ubuntu/+source/retext/+bug/1451125:

sudo apt-get install python3-docutils python3-markdown

17/05/2015

Entering BIOS of HP EliteBook 840

From the manual:

  1. Turn on or restart the computer, and then press esc while the “Press the ESC key for Startup Menu” message is displayed at the bottom of the screen
  2. Press f10 to enter Computer Setup.

Check hard disk for bad sectors/blocks

sudo badblocks -sv /dev/sda1

See also:http://askubuntu.com/questions/59064/how-to-run-a-checkdisk

14/04/2015

Location of Virtual Box Guest additions on Linux host machine

/usr/share/virtualbox

23/03/2015

How to get rid of clock skew errors while building packages on VM

Run this on host machine:

sudo ntpdate ntp.xs4all.nl

Then re-start VM; host and guest are now in sync and no more clock skew errors.

17/03/2015

Markdown to HTML (with smart quotes) in Pandoc

pandoc -S whatever.md -o whatever.html

12/03/2015

Validating code lists with Schematron

http://broadcast.oreilly.com/2008/11/validating-code-lists-with-sch.html

02/03/2015

Character sets

Handige Unicode en UTF-8 achtergrondinfo:

http://codesnippets.wpakb.kb.nl/index.php?title=Character_sets

17/02/2015

EPUB creation tool

Sigil:

https://github.com/user-none/Sigil

Simple, use-friendly.

04/02/2015

ISO Image creation

ddrescue:

http://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html

Command line (Cygwin):

ddrescue -b 2048 -v /dev/scd0 test.iso test.log

Info on image

disktype tool:

http://disktype.sourceforge.net/

E.g. reveals file system tyype (ISO/UDF), other tech info.

22/01/2015

Installing Windows 98 in VirtualBox

General instructions here:

http://www.msfn.org/board/topic/170785-virtualbox-windows-98se-step-by-step/

But results in error:

HID failed to attach mouse driver (VERR_PDM_NO_ATTACHED_DRIVER

Tried this:

https://forums.virtualbox.org/viewtopic.php?f=2&t=58657#p272752

VBoxInternal/USB/HidMouse/1/Config/CoordShift 0

Still doesn't work; neither does:

VBoxInternal/USB/HidMouse/1/Config/CoordShift 1

But see:

https://www.virtualbox.org/manual/ch12.html#idp60139152

Installing Windows 2000 in VirtualBox

Windows 2000 installation failures:

https://www.virtualbox.org/manual/ch12.html#idp60119680

Works!

Then go install guest additions:

https://docs.oracle.com/cd/E36500_01/E36502/html/qs-guest-additions.html

09/12/2014

AsciiMath

"AsciiMath is an easy-to-write markup language for mathematics":

http://asciimath.org/

03/12/2014

Git cheat sheet

Add all files in directory tree to the index (an remove deleted ones)

git add -A

Commit

git commit -m "Changed everything"

Push to master

git push origin master

Push to some other repo (provided I have the rights for this)

git push [email protected]:openplanets/jpylyzer-test-files.git master

Versioning / tagging

Versioning: x.y.z

x: API breakage y: new feature z: bugfix

Add tag

git tag -a 1.1.0 -m "tagging vesion 1.1.1 with refactored code"

Push tags

git push --tags

02/12/2014

Create test dataset according to new KB digitisation specs from old JP2 batch

​1. Convert all master JP2s to TIFF with ImageMagick, using the command:

mogrify -format tiff *.jp2

​2. Conversion loses resolution info (see below), so add new values using:

exiftool *.tiff -xresolution=300 -yresolution=300 -resolutionunit=inches

​3. Convert TIFFs to master JP2s:

f:\johan\pythoncode\jpwrappa\jpwrappa\jpwrappa.py M:\Trans\johan\testJP2ContrApp2014\B5\tiff\*.tiff M:\Trans\johan\testJP2ContrApp2014\B5\jp2k\master\ -p F:\johan\pythonCode\jpwrappa\jpwrappa\profiles\optionsKBMasterLossless_2014.xml -m

​4. Same for access JP2s:

f:\johan\pythoncode\jpwrappa\jpwrappa\jpwrappa.py M:\Trans\johan\testJP2ContrApp2014\B5\tiff\*.tiff M:\Trans\johan\testJP2ContrApp2014\B5\jp2k\access\ -p F:\johan\pythonCode\jpwrappa\jpwrappa\profiles\optionsKBAccessLossy_2014.xml -m

But ... looking at image header box:

<imageHeaderBox> <height>2818</height> <width>1913</width> <nC>1</nC> <bPCSign>unsigned</bPCSign> <bPCDepth>8</bPCDepth> <c>jpeg2000</c> <unkC>yes</unkC> <iPR>no</iPR> </imageHeaderBox>

So "unknown colourspace" is set to "yes", which should be no (and it is "No" in the source JP2). So what is causing this? Bug in Aware software? Does this only happen with Grayscale images?

Aware codec produces JP2s that are not valid if TIFF doesn't contain resolution info

To reproduce the problem:

  • Convert any JP2 to TIFF with ImageMagick (will strip away any resolution info)
  • Convert TIFF to JP2 with Aware.

Run jpylyzer on resulting JP2:

<isValidJP2>False</isValidJP2> <tests> <jp2HeaderBox> <resolutionBox> <captureResolutionBox> <hRcNIsValid>False</hRcNIsValid> </captureResolutionBox> </resolutionBox> </jp2HeaderBox> </tests>

Looking at properties of resolution box:

<resolutionBox> <captureResolutionBox> <vRcN>29491</vRcN> <vRcD>7491</vRcD> <hRcN>0</hRcN> <hRcD>1</hRcD> <vRcE>1</vRcE> <hRcE>4</hRcE> <vRescInPixelsPerMeter>39.37</vRescInPixelsPerMeter> <hRescInPixelsPerMeter>0.0</hRescInPixelsPerMeter> <vRescInPixelsPerInch>1.0</vRescInPixelsPerInch> <hRescInPixelsPerInch>0.0</hRescInPixelsPerInch> </captureResolutionBox> </resolutionBox>

25/11/2014

Encodings and writing to file (Unicode)

Here for UTF-8:

http://stackoverflow.com/a/9822937

20/11/2014

Jpylyzer Ubuntu / Debian links

Clone specific branch of Github repo

git clone https://github.com/openpreserve/jpylyzer.git --branch gh-pages --single-branch ./jpylyzerHomepage

7/11/2014

Refs to external macros in Excel workbook

File:

E:\\laPeyneCDROM\\xlsfiles\\series98.xls

Refs to MACROS.XLS'!ENash, which is missing.

Solution: before opening, disable automatic workbook calculation from options:

Loading spreadsheet now results in most recent values that are stored in workbook.

27/10/2014

Google search by file extension

thermo filetype:tdb

Only gives results with extension tdb.

16/10/2014

CD imaging

09/10/2014

Publisher data formats

https://spotdocs.scholarsportal.info/display/EJournals/Publisher+Data+Formats

06/10/2014

EPUBCHECK validation errors/warnings

Both errors and warnings reported to same _message_ element in XML. E.g. compare:

  <status>Not well-formed</status>
  <messages>
   <message>ERROR: /OEBPS/cover.html(5): non-standard stylesheet resource 'OEBPS/page-template.xpgt' of type 'application/vnd.adobe-page-template+xml'. A fallback must be specified.</message>
   <message>ERROR: /OEBPS/copyright.html(5): non-standard stylesheet resource 'OEBPS/page-template.xpgt' of type 'application/vnd.adobe-page-template+xml'. A fallback must be specified.</message>
   </messages>

with this:

  <status>Well-formed</status>
  <messages>
   <message>WARN: /OEBPS/toc.ncx: meta@dtb:uid content 'null' should conform to unique-identifier in content.opf: '821'</message>
  </messages>

So output needs some parsing. Tested w. epubcheck 3.0.1.

02/10/2014

Externe schijven Windows PC

  • E drive: Hitachi (grote drive)
  • H drive: Buffalo (kleine drive)

H gebruikt als backupdisk van E.

26/09/2014

Jpylyzer poster DPC / 4C

17/18 november, poster gecanceld, wel 90 s praatje + 1 slide.

10/09/2014

Jpylyzer users & links

BnF:

http://www.bnf.fr/documents/ref_num_fichier_image.pdf

04/09/2014

Ebook vs paper

Readers absorb less on Kindles than on paper, study finds:

http://www.theguardian.com/books/2014/aug/19/readers-absorb-less-kindles-paper-study-plot-ereader-digitisation

Reading and learning from screens versus print: a study in changing habits: Part 1 – reading long information rich texts:

http://www.emeraldinsight.com/doi/full/10.1108/NLW-01-2013-0012

http://www.scientificamerican.com/article/reading-paper-screens/

21/08/2014

Syncing a fork in Github

https://help.github.com/articles/syncing-a-fork

Requires:

https://help.github.com/articles/configuring-a-remote-for-a-fork

27/02/2014

Useful Python shit

10/02/2014

Create PDF from multiple TIFFS

GraphicsMagick command line:

gm convert -compress jpeg -quality 50 *.TIF test.pdf

Result: PDF with all images as JPEG, quality 50. According to Acrobat / Apache Preflight the PDF has some format conformance issues. One possible remedy is to re-process the PDF using Ghostscript. E.g. command below produces a PDF that conforms to PDF/A-1b::

gswin64 -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=test_a.pdf test.pdf

Source: http://stackoverflow.com/questions/1659147/how-to-use-ghostscript-to-convert-pdf-to-pdf-a-or-pdf-x

05/02/2014

2013: 74% of Dutch e-books distributed without DRM

http://www.cb-logistics.nl/wp-content/uploads/2013/01/74-percent-of-Dutch-e-books-distributed-without-DRM.pdf

04/02/2014

Unix Commands and Batch Processing for the Reluctant Librarian or Archivist

Link: http://journal.code4lib.org/articles/9158

03/02/2014

How to estimate JPEG Quality

Tutorial:

http://fotoforensics.com/tutorial-estq.php

But ... this is also possible with ImageMagick / GraphicsMagick (according to Approximate Quantization Table method that is mentioned in the tutorial):

http://superuser.com/questions/62730/how-to-find-the-jpg-quality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment