Skip to content

Instantly share code, notes, and snippets.

@alanzchen
Last active April 30, 2024 23:14
Show Gist options
  • Save alanzchen/f86661d87c903713dd0e00e99a47f4ae to your computer and use it in GitHub Desktop.
Save alanzchen/f86661d87c903713dd0e00e99a47f4ae to your computer and use it in GitHub Desktop.
Copying PDFs from DevonThink to their corresponding Zotero items
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "particular-dominican",
"metadata": {},
"source": [
"# Migrating Papers in DevonThink to Zotero\n",
"\n",
"So I have been using DevonThink to manage my literature and notes as it allows me to refer to a specific PDF (even a specific page number) using a URL. When I am taking notes in Roam Research and in any other apps, I can easily refer a specific page of a paper using this URL. I've been using Zetero purely as a citation manager for generating Bibtex and inserting ciations in Word and Google Docs, but I've never actively managed my readings there.\n",
"\n",
"Previously, my workflow looks like this:\n",
"\n",
"1. Collect full-text PDF and add the paper to Zotero with 1 click.\n",
"2. Move the PDF into my DevonThink.\n",
"3. Copy the URL of the PDF in DevonThink.\n",
"4. Open the PDF side-by-side and start writing notes in Logseq.\n",
"\n",
"A few months ago, I transited from Roam Research to Logseq, which has terrific support for in-app PDF annotations and native Zotero integration. It is a game changer for me -- I can simply collect a paper from web with a click of a button, then take notes on that paper within Logseq and anotate the PDF. The best part is I can create bidirectional link to every anotation I made on the PDF.\n",
"\n",
"In sum, this workflow will allow me to:\n",
"\n",
"1. Collect full-text PDF and add the paper to Zotero with 1 click.\n",
"2. Directly consume the PDF within Logseq.\n",
"\n",
"Logseq + Zotero greatly reduces the friction of my literature review process. But over the years I've collected a sizable amount of anotated PDFs in my DevonThink library. To enjoy Logseq + Zotero, I need the PDF to be living directly in Zotero. Obviously, moving all the papers manually from my DevonThink library can be a very tiring proecss, so I decided to automate the migration process.\n",
"\n",
"## The Solution\n",
"\n",
"1. Get the metadata of all papers living in Zotero. That's the superset of my paper in DevonThink.\n",
"2. For each of the paper in Zotero, find the corresponding PDF in DevonThink.\n",
"3. Copy the found PDF to Zotero.\n",
"\n",
"To achieve this, I need the help from three additional little tools.\n",
"\n",
"First, finding the coresponding PDF in DevonThink can be a headache. I tried using `pdfgrep` to search strings in PDFs in my DevonThink library, but it took too much time to search for a single string in my entire PDF collection, probably due to the lack of indexing. Lucikly, searching the DOI using DevonThink is lightning fast, which returns the result within a second. Thanks to the scriptablility of macOS and DevonThink, I can use `search.js` provided by an [Alfred plugin](https://github.com/mpco/AlfredWorkflow-DEVONthink-Search/blob/master/search.js) to do the job for me.\n",
"\n",
"Another headache is that sometimes DOI and the title of one paper may appear in the reference section of another paper. To rule out these false positive papers, I use `pdfgrep` to verify if the matching string appears within the first three pages of the PDF.\n",
"\n",
"Finally, I need to insert the PDF as an attachment to Zotero. To do so, I use a function provided by [ZotFile](http://zotfile.com) (`ZotFile.attachFile()`).\n",
"\n",
"## Getting Started\n",
"\n",
"You will need to install the aforementioned three tools:\n",
"\n",
"1. `brew install pdfgrep`\n",
"2. install [ZotFile](http://zotfile.com).\n",
"3. `curl https://raw.githubusercontent.com/mpco/AlfredWorkflow-DEVONthink-Search/master/search.js -o search.js`\n",
"\n",
"Then follow alone with the following steps.\n",
"\n",
"## Step 0: Remove all Attachments in Zotero (optional)\n",
"\n",
"If you already have some PDF stored in Zotero, I recommend removing them to avoid duplicates. You can bulk delete them using a smart saved search in Zotero.\n",
"\n",
"## Step 1: Get Metadata from Zotero\n",
"\n",
"For the safety of your data, quit Zotero first. Then execute the following cells."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "attractive-generic",
"metadata": {},
"outputs": [],
"source": [
"import sqlite3\n",
"import json\n",
"import os"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "polished-figure",
"metadata": {},
"outputs": [],
"source": [
"# change these paths\n",
"db = \"/Users/alan/Zotero/zotero.sqlite\"\n",
"storage_base = \"/Users/alan/Zotero/storage/\"\n",
"con = sqlite3.connect(db)\n",
"cur = con.cursor()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "raising-identification",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# get all items with DOIs\n",
"ITEM_ID_AND_DOI = \"\"\"\n",
"SELECT key, value FROM items JOIN itemData, itemDataValues WHERE items.itemID == itemData.itemID AND itemData.valueID == itemDataValues.valueID AND itemData.fieldID == \"26\"\n",
"\"\"\"\n",
"items_doi = dict(cur.execute(ITEM_ID_AND_DOI))\n",
"for k, v in items_doi.items():\n",
" items_doi[k] = v.strip(\"https://doi.org/\")\n",
"\n",
"# get all items titles\n",
"ITEM_ID_AND_TITLE = \"\"\"\n",
"SELECT key, value FROM items JOIN itemData, itemDataValues WHERE items.itemID == itemData.itemID AND itemData.valueID == itemDataValues.valueID AND itemData.fieldID == \"110\"\n",
"\"\"\"\n",
"items_title = dict(cur.execute(ITEM_ID_AND_TITLE))\n",
"items_title = {k: items_title[k] for k in items_title if k in items_doi.keys()}\n",
"\n",
"# GET existing PDF attachments\n",
"ITEM_ID_AND_PATH = \"\"\"\n",
"SELECT key, path FROM items JOIN itemAttachments WHERE items.itemID == itemAttachments.itemID AND contentType == \"application/pdf\"\n",
"\"\"\"\n",
"items_pdf_path = dict(cur.execute(ITEM_ID_AND_PATH))\n",
"not_found_cnt = 0\n",
"for k, v in items_pdf_path.items():\n",
" p = storage_base + k + \"/\" + v.replace(\"storage:\", \"\")\n",
" if not os.path.exists(p):\n",
" # print(p, \"not found\")\n",
" not_found_cnt += 1\n",
" items_pdf_path[k] = p\n",
"\n",
"print(not_found_cnt, \"pdfs not found.\")\n",
"\n",
"KEY_TO_ID = \"\"\"\n",
"SELECT key, itemID FROM items;\n",
"\"\"\"\n",
"\n",
"key2id = dict(cur.execute(KEY_TO_ID))"
]
},
{
"cell_type": "markdown",
"id": "western-browse",
"metadata": {},
"source": [
"## Step 2: Find Corresponding PDFs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ahead-transparency",
"metadata": {},
"outputs": [],
"source": [
"import subprocess\n",
"\n",
"\n",
"def find_pdf(s):\n",
" out = subprocess.run([\"/usr/bin/osascript\", \"-l\", \"JavaScript\", \"search.js\", s],\n",
" capture_output=True)\n",
" out.check_returncode()\n",
" if out.stderr:\n",
" items = json.loads(out.stderr)['items']\n",
" else:\n",
" return []\n",
" final_items = []\n",
" # now double check if the string appears in the first three pages of the PDF\n",
" # this is to make sure that we are not mathcing anything in the references\n",
" for i in items:\n",
" f = i['quicklookurl']\n",
" out = subprocess.run([\"pdfgrep\", s, f, \"--page-range=1-3\", \"-c\", \"-i\"],\n",
" capture_output=True)\n",
" try:\n",
" if int(out.stdout) > 0 & os.path.exists(f):\n",
" final_items.append(i)\n",
" except:\n",
" pass\n",
" return final_items"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dimensional-senator",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"item_doi_match = {} \n",
"item_doi_mismatch = {}\n",
"for k, v in items_doi.items():\n",
" doi = v\n",
" r = find_pdf(doi)\n",
" if len(r) > 1:\n",
" # print(doi, \"has more than 1 result.\")\n",
" paths = [i['quicklookurl'] for i in r]\n",
" item_doi_mismatch[k] = paths\n",
" elif len(r) == 1:\n",
" path = r[0]['quicklookurl']\n",
" # print(doi, \"found at\", path)\n",
" item_doi_match[k] = path"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "royal-title",
"metadata": {},
"outputs": [],
"source": [
"# Not all papers comes with a DOI, so we can also search by the title\n",
"\n",
"item_title_match = {} \n",
"item_title_mismatch = {}\n",
"for k, v in items_title.items():\n",
" title = v\n",
" r = find_pdf(title)\n",
" if len(r) > 1:\n",
" # print(doi, \"has more than 1 result.\")\n",
" paths = [i['quicklookurl'] for i in r]\n",
" item_title_mismatch[k] = paths\n",
" elif len(r) == 1:\n",
" path = r[0]['quicklookurl']\n",
" # print(doi, \"found at\", path)\n",
" item_title_match[k] = path"
]
},
{
"cell_type": "markdown",
"id": "laughing-anderson",
"metadata": {},
"source": [
"And here are the results of matching PDFs found in the DevonThink library. Now we export the results as a JSON file that Zotero will read later."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cardiovascular-jaguar",
"metadata": {},
"outputs": [],
"source": [
"zotero_import = {}\n",
"for k, v in item_doi_match.items():\n",
" zotero_import[key2id[k]] = v\n",
"for k, v in item_title_match.items():\n",
" zotero_import[key2id[k]] = v"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "proud-liberal",
"metadata": {},
"outputs": [],
"source": [
"with open(\"zotero_import.json\", 'w') as f:\n",
" json.dump(zotero_import, f)"
]
},
{
"cell_type": "markdown",
"id": "former-fancy",
"metadata": {},
"source": [
"## Step 3: Import PDFs in Zotero\n",
"\n",
"Now run Zotero and choose Tools - Developer - Run Javascript. Make sure ZotFile is installed.\n",
"\n",
"Copy and paste the following Javascript. Change the path as you see fit.\n",
"\n",
"```js\n",
"var path = '/Users/alan/zotero_import.json';\n",
"var data_ = await Zotero.File.getContentsAsync(path);\n",
"var data = JSON.parse(data_)\n",
"var error = new Map()\n",
"for (const [k, v] of Object.entries(data)) {\n",
" var i = Zotero.Items.get(k)\n",
" try {\n",
" await Zotero.ZotFile.attachFile(i, v)\n",
" } catch {\n",
" error.set(k, v)\n",
" }\n",
"}\n",
"return error\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "molecular-sculpture",
"metadata": {},
"source": [
"## Step 4: Manually Handle Edge Cases\n",
"\n",
"First, you would like to addresss the errors from the last step and see what's wrong.\n",
"\n",
"Second, `item_title_mismatch` and `item_doi_mismatch` store the papers with more than 1 match in your DevonThink library. You should manually add them."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "boring-framing",
"metadata": {},
"outputs": [],
"source": [
"item_title_mismatch # these papers will require some manual handling"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "reserved-attitude",
"metadata": {},
"outputs": [],
"source": [
"item_doi_mismatch"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment