Skip to content

Instantly share code, notes, and snippets.

@fauxneticien
Last active June 23, 2023 06:22
Show Gist options
  • Save fauxneticien/fec61e1e8b690b6b65748d90b113a7f8 to your computer and use it in GitHub Desktop.
Save fauxneticien/fec61e1e8b690b6b65748d90b113a7f8 to your computer and use it in GitHub Desktop.
Research knowledge base with Zotero, Highlights, and Obsidian

Research knowledge base with Zotero, Highlights, and Obsidian

Ever since I started working on my honours thesis in 2013, I had been tinkering with various workflows to manage references, PDFs, PDF annotations and notes all in some coherent way. The workflow I describe here is the latest one (February 2021 as of writing), and I think I've finally found something that satisfies a lot of the [admittedly very subjective] desiderata.

Knowledge provenance

Being [perhaps overly] wary of mis-citing something, I'd like to be able to quickly go back to the original source, and the exact page and PDF highlight that I'm referring to. For the last couple of years, I have been using the Highlights App (MacOS only, unfortunately; though there may be Windows/Linux equivalents). The two main features of Highlights are that:

  1. It automatically extracts PDF highlights and is able to keep them updated in a 'sidecar' file, so for a file like Ram_et_al_2020_Neural_Network.pdf, there'll be a Ram_et_al_2020_Neural_Network.md.
  2. It can organise the PDF annotations by colour categories, which you can re-label as desired (e.g. Red = Method) and sort all annotations by colour in your order of preference.

So going through and highlighting a PDF looks like this (note red highlights shoved into a 'Method' section with corresponding page number):

Screen Shot 2021-02-04 at 9.56.30 PM

Given these colour category settings on the left, a Markdown-formatted text file shown on the right is produced (and maintained as you add/remove highlights in the PDF).

Knowledge management with Obsidian

Up until recently (uh, February 2021), I had just used MacOS's spotlight to directly access the PDFs whenever I needed to [re-]find a specific annotation. I typically know the first author and year, so typing Ram 2020 is enough to bring up the PDF and open in Highlights, and then manually scroll to the specific page/annotation.

I recently discovered Obsidian, which describes itself as "Obsidian is a powerful knowledge base that works on top of a local folder of plain text Markdown files."

Basically, if you have a bunch of Markdown files in a folder (what Obsidian calls a 'Vault'), you can create, manage, and view/search/browse the content in these Markdown files using links (what Obsidian calls 'connections'). You define these connections using the Obsidian's bracket notation:

  • [[some_markdown_file]]: Link to the file called some_markdown_file.md
  • [[some_markdown_file#Page 2]]: Link to the section called Page 2 in the file called some_markdown_file.md

The best thing about Obsidian is the autocomplete that pops up as soon as you type [[, so you can search for the relevant file:

Another bonus is that with Obsidian, I can split the editor/viewer into various configurations as desired:

This lets me view highlights from the original source in a different pane (the bottom one), while continuing to edit whatever primary document I'm currently working on. I must say not having to go look for the PDF, open the PDF in Highlights, scroll to the relevant page/annotation has made it so much easier to quickly find the part of a work I'm trying to cite (and not get accidentally distracted or forget what I was searching for along the way, which happens far too easily).

Reference and PDF management with Zotero

So as mentioned Obsidian relies on links between Markdown files and Highlights creates Markdown files that are the same name as the PDF file you're annotating. This brings us to managing PDFs and PDF names because you'll want to create meaningful and stable PDF (and thus Markdown) file names for Obsidian and Highlights to work together.

Personally, I use Zotero with the ZotFile extension, which lets you define naming rules based on the PDF metadata. So any article I decide to actually read (uh, not those in my 'to read' folder) and highlight, gets copied into a certain directory and renamed as appropriate. As you might have guessed, the scheme I have is first author-year-first words of title up to 15 characters (e.g. Ram_et_al_2020_Neural_Network.pdf). This is typically enough to offer a unique name for every reference in your collection.

You could probably use Mendeley, Bibdesk, or Papers or any other PDF/Reference management software. The bonus for me with Zotero is the Better BibTex extension which helps me keep an up-to-date '.bib' version of my entire reference collection, so I can use it in any LaTeX documents I'm writing.

Bonus automation

One slight issue with Highlights is that if you annotate images in PDFs (basically, screenshots of graphs/figures) it can't save into a plain .md file. So it saves the relevant screenshots into an assets folder and the text into a text.markdown file, all of which are contained in a sidecar folder:

|- /
|- Ram_et_al_2020_Neural_Network.pdf
|- Ram_et_al_2020_Neural_Network/
|  |- text.markdown
|  |- assets/
|  |  |- Page10Image32.66077_184.0669-261.85003_145.6573.png

So I wrote a Python script to convert Ram_et_al_2020_Neural_Network/text.markdown into Ram_et_al_2020_Neural_Network.md at export time, and also copy over everything in the assets folder into the Sources/assets folder which is part of my Obsidian Vault:

|- /
|- Daily notes/
|- Things I keep having to Google/
|- Topics/
|  |- machine-learning.md
|  |- speech-processing.md
|- Sources/
|  |- Baevski_et_al_to_appear_wav2vec_2.md
|  |- Bartelds_et_al_2020_Neural.md
|  |- Ram_et_al_2020_Neural_Network.md
|  |- assets/
|  |  |- Page10Image32.66077_184.0669-261.85003_145.6573.png

Here's the script:

Note: I also prepend a link to the PDF at the top of the Markdown file, so I can easily get back to the PDF from the Obsidian viewer (my default PDF viewer is Highlights, so the file:/// link opens the PDF in Highlights).

#!/usr/bin/env python

import os, glob, shutil, pathlib

obsedian_path = '/Users/fauxneticien/Google Drive/gnawledges/gnawledges/Sources/'

# For pure Markdown files without assets (.md sidecar file is same name as pdf)
for md_file in glob.glob('/Users/fauxneticien/Zotero/storage/**/*.md'):
    p = pathlib.PurePath(md_file).parents[0]

    pdf_path = glob.glob(str(p) + '/*.pdf')[0]
    pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]

    pdf_link = "[%s](file:///%s)\n" % (pdf_name, pdf_path)
    md_content = pathlib.Path(md_file).read_text()

    with open(obsedian_path + pdf_name + ".md", "w") as output_md:
        output_md.write(pdf_link + md_content)

# For TextBundle folders with assets folder
for tb_path in glob.glob('/Users/fauxneticien/Zotero/storage/**/*.textbundle'):
    p = pathlib.PurePath(tb_path).parents[0]

    pdf_path = glob.glob(str(p) + '/*.pdf')[0]
    pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]

    pdf_link = "[%s](file:///%s)\n" % (pdf_name, pdf_path)
    md_content = pathlib.Path(tb_path + '/text.markdown').read_text()

    with open(obsedian_path + pdf_name + ".md", "w") as output_md:
        output_md.write(pdf_link + md_content)

    # Copy over assets
    for asset in glob.glob(tb_path + '/assets/*'):
        shutil.copy(asset, obsedian_path + 'assets')

To be able to call the python script on demand, I turned it into a 'quick action' for Highlights using MacOS's Automator:

Which appears in Highlights as a custom service:

I also created a keyboard shortcut in System Preferences so I can quickly export whatever PDF I'm currently annotating into my Obsidian Vault.

In Hindsight, I might actually also create a service for Obsidian and call it 'Import all Highlights annotations' or something like that. It's quite rare that I'm simultaneous highlighting and writing prose at the same time. I usually take notes within Highlights itself (which, of course, also gets exported in the Markdown).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment