Ever since I started working on my honours thesis in 2013, I had been tinkering with various workflows to manage references, PDFs, PDF annotations and notes all in some coherent way. The workflow I describe here is the latest one (February 2021 as of writing), and I think I've finally found something that satisfies a lot of the [admittedly very subjective] desiderata.
Being [perhaps overly] wary of mis-citing something, I'd like to be able to quickly go back to the original source, and the exact page and PDF highlight that I'm referring to. For the last couple of years, I have been using the Highlights App (MacOS only, unfortunately; though there may be Windows/Linux equivalents). The two main features of Highlights are that:
- It automatically extracts PDF highlights and is able to keep them updated in a 'sidecar' file, so for a file like
Ram_et_al_2020_Neural_Network.pdf
, there'll be aRam_et_al_2020_Neural_Network.md
. - It can organise the PDF annotations by colour categories, which you can re-label as desired (e.g. Red = Method) and sort all annotations by colour in your order of preference.
So going through and highlighting a PDF looks like this (note red highlights shoved into a 'Method' section with corresponding page number):
Given these colour category settings on the left, a Markdown-formatted text file shown on the right is produced (and maintained as you add/remove highlights in the PDF).
Up until recently (uh, February 2021), I had just used MacOS's spotlight to directly access the PDFs whenever I needed to [re-]find a specific annotation. I typically know the first author and year, so typing Ram 2020
is enough to bring up the PDF and open in Highlights, and then manually scroll to the specific page/annotation.
I recently discovered Obsidian, which describes itself as "Obsidian is a powerful knowledge base that works on top of a local folder of plain text Markdown files."
Basically, if you have a bunch of Markdown files in a folder (what Obsidian calls a 'Vault'), you can create, manage, and view/search/browse the content in these Markdown files using links (what Obsidian calls 'connections'). You define these connections using the Obsidian's bracket notation:
[[some_markdown_file]]
: Link to the file calledsome_markdown_file.md
[[some_markdown_file#Page 2]]
: Link to the section calledPage 2
in the file calledsome_markdown_file.md
The best thing about Obsidian is the autocomplete that pops up as soon as you type [[
, so you can search for the relevant file:
Another bonus is that with Obsidian, I can split the editor/viewer into various configurations as desired:
This lets me view highlights from the original source in a different pane (the bottom one), while continuing to edit whatever primary document I'm currently working on. I must say not having to go look for the PDF, open the PDF in Highlights, scroll to the relevant page/annotation has made it so much easier to quickly find the part of a work I'm trying to cite (and not get accidentally distracted or forget what I was searching for along the way, which happens far too easily).
So as mentioned Obsidian relies on links between Markdown files and Highlights creates Markdown files that are the same name as the PDF file you're annotating. This brings us to managing PDFs and PDF names because you'll want to create meaningful and stable PDF (and thus Markdown) file names for Obsidian and Highlights to work together.
Personally, I use Zotero with the ZotFile extension, which lets you define naming rules based on the PDF metadata. So any article I decide to actually read (uh, not those in my 'to read' folder) and highlight, gets copied into a certain directory and renamed as appropriate. As you might have guessed, the scheme I have is first author-year-first words of title up to 15 characters (e.g. Ram_et_al_2020_Neural_Network.pdf
). This is typically enough to offer a unique name for every reference in your collection.
You could probably use Mendeley, Bibdesk, or Papers or any other PDF/Reference management software. The bonus for me with Zotero is the Better BibTex extension which helps me keep an up-to-date '.bib' version of my entire reference collection, so I can use it in any LaTeX documents I'm writing.
One slight issue with Highlights is that if you annotate images in PDFs (basically, screenshots of graphs/figures) it can't save into a plain .md
file. So it saves the relevant screenshots into an assets folder and the text into a text.markdown
file, all of which are contained in a sidecar folder:
|- /
|- Ram_et_al_2020_Neural_Network.pdf
|- Ram_et_al_2020_Neural_Network/
| |- text.markdown
| |- assets/
| | |- Page10Image32.66077_184.0669-261.85003_145.6573.png
So I wrote a Python script to convert Ram_et_al_2020_Neural_Network/text.markdown
into Ram_et_al_2020_Neural_Network.md
at export time, and also copy over everything in the assets
folder into the Sources/assets
folder which is part of my Obsidian Vault:
|- /
|- Daily notes/
|- Things I keep having to Google/
|- Topics/
| |- machine-learning.md
| |- speech-processing.md
|- Sources/
| |- Baevski_et_al_to_appear_wav2vec_2.md
| |- Bartelds_et_al_2020_Neural.md
| |- Ram_et_al_2020_Neural_Network.md
| |- assets/
| | |- Page10Image32.66077_184.0669-261.85003_145.6573.png
Here's the script:
Note: I also prepend a link to the PDF at the top of the Markdown file, so I can easily get back to the PDF from the Obsidian viewer (my default PDF viewer is Highlights, so the file:///
link opens the PDF in Highlights).
#!/usr/bin/env python
import os, glob, shutil, pathlib
obsedian_path = '/Users/fauxneticien/Google Drive/gnawledges/gnawledges/Sources/'
# For pure Markdown files without assets (.md sidecar file is same name as pdf)
for md_file in glob.glob('/Users/fauxneticien/Zotero/storage/**/*.md'):
p = pathlib.PurePath(md_file).parents[0]
pdf_path = glob.glob(str(p) + '/*.pdf')[0]
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
pdf_link = "[%s](file:///%s)\n" % (pdf_name, pdf_path)
md_content = pathlib.Path(md_file).read_text()
with open(obsedian_path + pdf_name + ".md", "w") as output_md:
output_md.write(pdf_link + md_content)
# For TextBundle folders with assets folder
for tb_path in glob.glob('/Users/fauxneticien/Zotero/storage/**/*.textbundle'):
p = pathlib.PurePath(tb_path).parents[0]
pdf_path = glob.glob(str(p) + '/*.pdf')[0]
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
pdf_link = "[%s](file:///%s)\n" % (pdf_name, pdf_path)
md_content = pathlib.Path(tb_path + '/text.markdown').read_text()
with open(obsedian_path + pdf_name + ".md", "w") as output_md:
output_md.write(pdf_link + md_content)
# Copy over assets
for asset in glob.glob(tb_path + '/assets/*'):
shutil.copy(asset, obsedian_path + 'assets')
To be able to call the python script on demand, I turned it into a 'quick action' for Highlights using MacOS's Automator:
Which appears in Highlights as a custom service:
I also created a keyboard shortcut in System Preferences so I can quickly export whatever PDF I'm currently annotating into my Obsidian Vault.
In Hindsight, I might actually also create a service for Obsidian and call it 'Import all Highlights annotations' or something like that. It's quite rare that I'm simultaneous highlighting and writing prose at the same time. I usually take notes within Highlights itself (which, of course, also gets exported in the Markdown).