This tutorial aims to provide technical information necessary for direct access to data stored in Zotero's local bibliographic database. The official documentation mentions the present topic briefly, so I decided to post a more detailed discussion.
Zotero's local bibliographic database is an SQLite database, the leading open-source relational database management system (RDBMS). Hence, direct access to this data necessitates basic familiarity with relational databases and SQL (free SQL resources are readily available on the Internet, including the official documentation focused on the SQLite SQL dialect and this tutorial).
Table of Contents
- Exploring Zotero Data Model for Direct Database Access
Zotero is an information manager designed for managing bibliographic information, which
- is open-source software,
- has a competitive feature set,
- has a long history of active development,
- employs the leading open-source data storage format.
The characteristics above are essential for an information manager that helps get the job done while not getting in the way, can be relied on in the long run, and doesn't lock the data.
Selected Zotero's item organization and retrieval feature highlights:
👍
- item organizations using tags and hierarchical collections (categories)
- full-text search, including search on tag names
- tag-based filters and complex metadata search queries via GUI
- items may belong to multiple collections
👎 missing collection features
- full-text search on collection names
- selecting/dragging/dropping multiple collections
- directory-like name uniqueness enforcement
Even though Zotero provides a wide variety of export data options, none of them, except, perhaps, for the native XML, preserves all metadata. Each export format has certain limitations, and the more truthful and complete formats will likely require more elaborate post-processing to suit particular needs. For this reason, it is a good idea to understand how Zotero models and stores the data and how to access/retrieve it directly, bypassing the Zotero software completely.
As an information manager, Zotero facilitates interaction between the user and the data storage backend. In this capacity, Zotero performs three core data functions - presentation (the GUI), modeling (transformation of data for optimal handling by the database), and transfer (storage/retrieval to/from the data storage backend). Bypassing Zotero software means using general-purpose database-specific tools (database administration software) for direct data retrieval from the database (and, possibly, its modification).
Because Zotero relies on the ubiquitous SQLite RDBMS for its data storage, well-developed database administration tools are readily available. Some examples of free and open-source software include, for example:
These tools still perform the data transfer function (mentioned above) but not modeling. Because the latter is absent, data presentation (which relies on appropriate modeling) is limited, only reflecting the storage format of raw data chunks supported by the database. While having full access to all data and metadata, the user becomes responsible for any necessary data interpretation, such as structure reconstruction.
Side notes
- While all these tools provide comparable functionality, the convenience of individual features varies between programs, so it might be a good idea to learn a few of them.
- Remembering one of the essential differences between SQLite and typical client/server databases (such as MySQL or PostgreSQL) may save a lot of trouble when using an SQLite manager and running SQL queries. Because a single small library file incorporates the entire SQLite engine, programs usually use their private (rather than the system) copy, which resides next to the GUI starting executable (a common practice on Windows). These private copies have different version numbers and slightly different feature sets (because of varying build options and continual development of new features). While these differences may have no immediate manifestation, if in doubt, always run basic SQLite introspection queries (for each client used!), checking the actual library version and available features.
- Sample SQL queries below require SQLite 3.46.0. For all mentioned clients, replace the "sqlite3.dll" file (on Windows) with the official current release or a custom build.
- The sample SQL queries below extensively use Common table expressions (CTE). Without CTEs, developing and understanding any practical query rapidly turns into intractable nightmares with size. Various tutorials on CTEs are available on the Internet, including one of my own.
Any data/information stored in a database only realizes its value when accessed, retrieved, and used. Hence, information management is about capturing/preserving/organizing (both individual values and the associated structure) information and ensuring that relevant information can be efficiently retrieved and used. An RDBMS is often a backend of choice, even when handling information outside the original scope of well-structured tabular data. The ubiquity and maturity of established RDBMSs usually outweigh the limitations associated with managing semi-structured information. Meanwhile, appropriate data modeling provides guidelines for chopping and shaping the original data for optimal storage and handling.
Typical mapping for structured datasets (composed of items sharing the same fixed set of data fields) involves storing individual dataset items (aka records) as table rows, with each field mapping to a dedicated table column. Establishing a similar mapping for semi-structured information, such as bibliographic records, is not as straightforward. Each bibliographic record/item describes a particular bibliographic source. Different source types (e.g., journal article, patent, report, etc.) may have shared (e.g., title, authors/creators, and URL) and source-specific (e.g., patent number and report type) fields. Further, some fields may possess a structure of their own. Notably, each bibliographic record may have one or multiple authors/creators, with different citation styles having varying rules regarding the use and formatting of this information.
One of the common approaches to storing semi-structured information in relational databases and implemented by Zotero is the "entity-attribute-value" (EAV) model. Instead of having the "table row per item" and "table column per field value" arrangement, an EAV data table follows a non-relational (or, rather, denormalized) "table row per field value" design, with all fields of an item "stacked" in the same table column of separate records ("vertical" design). An EAV table contains three columns:
- entity/item (e.g., a particular bibliographic record) identifier
- attribute/field (e.g., "title" or "URL") identifier
- value (e.g., "Zotero" or "https://zotero.org")
Because the EAV tables only store non-empty field values, this design may save considerable space for "sparse" data.
The ERD diagram below shows the "core" tables of the bibliographic Zotero database. Two-thirds of all database tables correspond to "secondary" features, apparently non-implemented features, or are otherwise irrelevant in the present context and not included in the diagram.
ERD diagram showing "core" Zotero tables
N.B. This diagram was produced by reverse engineering a Zotero database using ERD Concepts, removing less relevant tables, and manually rearranging the remaining tables for best fit.
Each bibliographic record stored in a Zotero database has a corresponding row in the "central" items table shown (and highlighted) in the center of the ERD diagram. The items table is a conventional table, providing item identifiers used by other tables (the itemID and key fields). The table also includes several general fields, particularly the associated itemTypeID referring to the record's Item Type (the first field in Zotero's "info" page shown for all records).
The three tables in the bottom-left quadrant of the ERD Diagram contain item type information. Two of them, itemTypes and fields, define programmatic codes for all record types ("typeName") and fields ("fieldName"). Each item type from the former table is a subset of fields from the latter. Hence, the many-to-many table itemTypeFields provides actual type definitions. This table also contains the "orderIndex" field, which indicates the order (top to bottom) in which Zotero's "info" pane shows the fields. The fourth table, baseFieldMappings, instructs Zotero which fields from different item types are considered equivalent for display and search purposes. Each of the four tables also contains a "companion" table with an additional suffix "Combined" in their names. These tables essentially contain the same information. There is also a second set of "companion" tables, having the prefix "custom" in their names but no data or references from the source code.
The associated textual localized labels for both type and field names presented to the user in Zotero GUI are located in the "chrome/locale" subfolders of the source tree (and the same path inside the "zotero.jar" file found in the program directory). Item types do not have other references in Zotero's source code. Therefore, it might be possible to define/add new types to Zotero by adding appropriate rows to the three type tables and the labeling information to locale files in the "chrome/locale" subfolders.
The following query returns the list of item identifiers and the associated type names (the query is intentionally wrapped in a CTE clause so that it could be directly copy-pasted and used in more complex queries):
WITH
itemsEx AS (
SELECT
items.itemID, items.itemTypeID,
itemTypes.typeName, items."key" AS itemKey
FROM items, itemTypes
WHERE items.itemTypeID = itemTypes.itemTypeID
)
SELECT * FROM itemsEx;
The following query returns the list of types and the associated field lists formatted as JSON arrays:
SELECT
itemTypeFields.itemTypeID, itemTypes.typeName,
json_group_array(fieldName ORDER BY orderIndex) AS fieldNames
FROM itemTypeFields, fields, itemTypes
WHERE itemTypeFields.fieldID = fields.fieldID
AND itemTypeFields.itemTypeID = itemTypes.itemTypeID
GROUP BY itemTypeFields.itemTypeID
ORDER BY itemTypeFields.itemTypeID;
The tables and queries above do not include author/creator fields. Each item type may contain certain creator types (per many-to-many table itemTypeCreatorTypes) from the list defined in the creatorTypes table (top-left quadrant of the ERD diagram).
The following query returns a list of item types and associated permissible creator types as JSON arrays:
SELECT
itemTypes.typeName,
json_group_array(creatorType
ORDER BY primaryField DESC,itemTypeCreatorTypes.creatorTypeID
) AS creatorTypeNames
FROM itemTypes, creatorTypes, itemTypeCreatorTypes
WHERE itemTypeCreatorTypes.creatorTypeID = creatorTypes.creatorTypeID
AND itemTypeCreatorTypes.itemTypeID = itemTypes.itemTypeID
GROUP BY itemTypes.typeName
ORDER BY itemTypes.itemTypeID;
A combination of the two queries yields a consolidated list of types with corresponding fields and creator names:
WITH
fieldNames AS (
SELECT
itemTypeFields.itemTypeID, itemTypes.typeName,
json_group_array(fieldName ORDER BY orderIndex) AS fieldNamesJSON
FROM itemTypeFields, fields, itemTypes
WHERE itemTypeFields.fieldID = fields.fieldID
AND itemTypeFields.itemTypeID = itemTypes.itemTypeID
GROUP BY itemTypeFields.itemTypeID
),
creatorTypeNames AS (
SELECT
itemTypes.typeName,
json_group_array(creatorType
ORDER BY primaryField DESC,itemTypeCreatorTypes.creatorTypeID
) AS creatorTypeNamesJSON
FROM itemTypes, creatorTypes, itemTypeCreatorTypes
WHERE itemTypeCreatorTypes.creatorTypeID = creatorTypes.creatorTypeID
AND itemTypeCreatorTypes.itemTypeID = itemTypes.itemTypeID
GROUP BY itemTypes.typeName
)
SELECT fieldNames.*, creatorTypeNamesJSON
FROM fieldNames, creatorTypeNames
WHERE fieldNames.typeName = creatorTypeNames.typeName
ORDER BY itemTypeID
The right half of the ERD Diagram's top row shows two EAV tables, itemData and itemDataValues. It is unclear why developers split the EAV table into two. Naturally, the third column of the itemData table would be itemDataValues.value, not valueID. The following query reconstructs items and returns them as JSON objects:
WITH
eav AS (
SELECT itemData.itemID, itemData.fieldID, itemDataValues.value
FROM itemData, itemDataValues
WHERE itemData.valueID = itemDataValues.valueID
),
eav_ex AS (
SELECT items.itemTypeID, eav.itemID, eav.fieldID, orderIndex, eav.value
FROM eav, items, itemTypeFields
WHERE (items.itemID, items.itemTypeID, eav.fieldID) =
(eav.itemID, itemTypeFields.itemTypeID, itemTypeFields.fieldID)
),
item_data AS (
SELECT
itemTypeID, itemID,
json_group_object(fields.fieldName, value) AS data
FROM eav_ex, fields
WHERE eav_ex.fieldID = fields.fieldID AND itemTypeID NOT IN (2, 26)
GROUP BY itemID
ORDER BY itemID
)
SELECT * FROM item_data;
Most relational databases require that each table column defines a specific type for all stored values. This requirement necessitates that the EAV value column stores any numeric value as text, and the attribute should include information used for original data reconstruction. SQLite, however, does not have this requirement and permits, by default, the storage of all data types in a single column.
Again, a dedicated table creators holds eponymous data, and the many-to-many table itemCreators defines creator names for each item (top-left quadrant in the ERD diagram) and their order (the orderIndex field). The table itemCreators contains the creatorTypeID field, which describes creators, not the item/creator relation, and should be in the creators table. The present design is more consistent with the creatorNames table name.
The following query returns creators for each item as JSON objects:
SELECT
items.itemID,
items."key",
json_object('creators',
json_group_array(
CASE creators.fieldMode
WHEN 0 THEN
json_object('creatorType', creatorTypes.creatorType,
'firstName', creators.firstName,
'lastName', creators.lastName)
ELSE
json_object('creatorType', creatorTypes.creatorType,
'fullName', creators.lastName)
END
ORDER BY orderIndex
)
) AS creators
FROM creators, creatorTypes, items, itemCreators
WHERE (itemCreators.creatorTypeID, itemCreators.creatorID, itemCreators.itemID) =
(creatorTypes.creatorTypeID, creators.creatorID, items.itemID)
GROUP BY itemCreators.itemID
ORDER BY itemCreators.itemID;
The itemTypes table includes two specialized types: attachment and note (which employ somewhat clunky, inconsistent, and questionable design decisions).
Each attachment object (I intentionally avoid calling it an item) has an associated record in the main items table, and the EAV tables discussed above may include three fields defined for this item type. At the same time, a dedicated table (itemAttachments, located below the items table in the ERD diagram) stores additional attachment metadata. The motivation for splitting attachment object fields between the EAV tables and the dedicated attachments table is unclear; IMHO, placing all fields in the attachments table would make more sense.
The parentItemID field is consistent with the idea that attachments are to be attached to regular (anything but attachment or note) items. Even though Zotero does not provide an option to create new "standalone" attachment items, a file may be dragged onto the main Zotero window, causing Zotero to make one (setting parentItemID to NULL). An existing attachment may also be dragged and dropped onto a different item, changing the parent item, or dropped onto the whitespace area in the central pane, which converts the status of an attachment object from attached to standalone. Attachments are not meant for use as standalone items, and such use is discouraged.
The other noteworthy field is linkMode, which defines the attachment subtype. The meaning of this field is not included in the database as a hypothetical linkModeTypes table but is defined in the "chrome/content/zotero/xpcom/attachments.js" source file as having five possible values:
- LINK_MODE_IMPORTED_FILE = 0;
- LINK_MODE_IMPORTED_URL = 1;
- LINK_MODE_LINKED_FILE = 2;
- LINK_MODE_LINKED_URL = 3;
- LINK_MODE_EMBEDDED_IMAGE = 4;
Contrary to what value names might suggest, files are never stored inside the database. The two linked types merely store information about their target location. The other three make a copy of the target file in the local storage directory and link to that copy. The last embedded image subtype does not have any associated values in the EAV tables.
The path field encodes the file system path of the attached file. Its value and interpretation depend on the attachment subtype:
- LINK_MODE_LINKED_URL
The path field is NULL because the attachment is merely an Internet address - LINK_MODE_LINKED_FILE
If a file is stored within the "Linked Attachment Base Directory" (as defined in "Preferences->Advanced->Files and Folders"), the field contains the prefix "attachments:" followed by the file's relative path. Otherwise, Zotero probably stores absolute paths, but this option should generally be avoided. - LINK_MODE_IMPORTED_FILE, LINK_MODE_IMPORTED_URL, LINK_MODE_EMBEDDED_IMAGE
The file is copied to a directory named after the key field from the items table record of the attachment item. This directory is, in turn, created inside the "storage" directory located inside the "Data Directory" (as defined in "Preferences->Advanced->Files and Folders"). The path includes the prefix "storage:" followed by the file name.
The following query returns consolidated attachment information as JSON objects for all items. The parent reference of standalone attachments is set to point to the item itself.
WITH
-- Based on the definition in "chrome\content\zotero\xpcom\attachments.js"
-- LINK_MODE_IMPORTED_FILE = 0;
-- LINK_MODE_IMPORTED_URL = 1;
-- LINK_MODE_LINKED_FILE = 2;
-- LINK_MODE_LINKED_URL = 3;
-- LINK_MODE_EMBEDDED_IMAGE = 4;
linkModeTypes(linkModeID, linkModeName) AS (
VALUES
(0, 'importedFile'),
(1, 'importedURL'),
(2, 'linkedFile'),
(3, 'linkedURL'),
(4, 'embeddedImage')
),
attachmentsEAV AS (
SELECT itemData.itemID, json_group_object(fieldName, value) AS fieldData
FROM itemData, itemDataValues, fields, items
WHERE (itemData.itemID, itemData.fieldID, itemData.valueID) =
(items.itemID, fields.fieldID, itemDataValues.valueID)
AND items.itemTypeID = 2
GROUP BY itemData.itemID
),
attachmentsMain AS (
SELECT
itemAttachments.itemID,
items."key" AS itemKey,
-- Take care of standalone attachment items (parentItemID IS NULL)
coalesce(parentItemID, itemAttachments.itemID) AS parentItemIDex,
parentItems."key" AS parentItemKey,
linkMode,
linkModeName,
iif(contentType IS NOT NULL,
'"contentType": "' || contentType || '",', '') AS contentType,
iif(path IS NOT NULL, '"path": "' || path || '",', '') AS path,
CASE linkMode
WHEN 2 THEN
'"pathSpec": "' ||
replace(path, 'attachments:', 'attachments:{BaseDir}/') || '",'
WHEN 3 THEN
''
ELSE
'"pathSpec": "' ||
replace(path, 'storage:', 'storage:{DataDir}/storage/' ||
items."key" || '/') || '",'
END AS pathSpec
FROM itemAttachments, linkModeTypes, items, items AS parentItems
WHERE itemAttachments.itemID = items.itemID
AND parentItemIDex = parentItems.itemID
AND linkMode = linkModeID
),
attachmentsPre AS (
SELECT
attachmentsMain.itemID, itemKey, parentItemIDex AS parentItemID,
parentItemKey, linkMode, linkModeName, contentType, path, pathSpec,
json_object('$.itemID', attachmentsMain.itemID,
'$.itemKey', itemKey,
'$.linkMode', linkMode,
'$.linkModeName', linkModeName
) AS dataA,
json('{' || contentType || path || pathSpec || '}') AS dataB,
coalesce(fieldData, '') AS fieldData
FROM attachmentsMain LEFT JOIN attachmentsEAV
ON attachmentsMain.itemID = attachmentsEAV.itemID
),
attachmentsCombined AS (
SELECT
itemID, itemKey, parentItemID, parentItemKey,
replace(replace(
dataA || dataB || fieldData, '{}', ''),'}{', ',') AS data
FROM attachmentsPre
ORDER BY itemID
),
itemattAchmentsCombined AS (
SELECT
parentItemID, parentItemKey,
json_object('Attachments',
json_group_array(
json(data) ORDER BY data ->> '$.title' COLLATE NOCASE
)
) AS Attachments
FROM attachmentsCombined
GROUP BY parentItemID
ORDER BY parentItemID
)
SELECT * FROM itemattAchmentsCombined;
Similar to attachments, notes may be attached or standalone. At the same time, Zotero provides direct support for standalone note creation and does not store any note information in the EAV tables. Instead, all note data is stored as HTML-formatted text in the dedicated itemNotes table, shown to the left from the items table in the ERD diagram. There is no dedicated note title field; instead, the title column is populated automatically by Zotero, which extracts the first line (or part of it) from the note text.
Zotero (or the Zotfile extension?) automatically extracts TOC from the attached PDF files, adding associated notes but not creating associated records in the items table and not filling the parentItemID field. Therefore, associating these automatic notes with appropriate items is not straightforward.
The following query returns all these TOC notes:
SELECT * FROM itemNotes
WHERE note LIKE '<div class="zotero-note znv1">' ||
'<p xmlns="http://www.w3.org/1999/xhtml" id="title"><strong>Contents</strong></p>' ||
'<ul xmlns="http://www.w3.org/1999/xhtml" ' ||
'style="list-style-type: none; padding-left:0px" id="toc">%'
The following query returns data for manually created notes:
WITH
entities AS (
SELECT itemID, "key" AS itemKey
FROM items
WHERE itemTypeID = 26
ORDER BY itemID
),
notes AS (
SELECT itemNotes.*, itemKey
FROM itemNotes
LEFT JOIN entities
ON itemNotes.itemID = entities.itemID
WHERE itemKey IS NOT NULL
)
SELECT * FROM notes;
Collections are directory-like (to a certain extent) hierarchical containers designed for reference organization. Collection information is contained in the collections and collectionItems tables, appearing in the bottom-right corner of the ERD diagram. Zotero's collections obey the following rules:
- Each collection has exactly one parent, except for the top-level collections, which do not have parents.
- Each item may be a member of zero or more collections.
- Items belonging to the same collection may have identical titles.
- Sibling collections may have identical names.
The collectionItems table is a conventional many-to-many table (additionally marking the order of items assigned to the same collection).
The collections table needs to store both collection names and hierarchy information. There are several approaches to handling trees/hierarchies in relational databases. Zotero implements the adjacency list model: each record contains a reference to its parent (parentCollectionID). Another helpful representation is materialized paths (aka path enumeration).
Converts collections (adjacency list) to materialized paths. Collection names and keys form two synonymous paths as JSON arrays of path elements, starting from the root element as the first array member.
WITH RECURSIVE
cols_unsorted AS (
SELECT
col.collectionID AS colID,
col.collectionName AS colName,
col.parentCollectionID AS parentColID,
col."key" AS colKey,
json_array(col.collectionName) AS pathName,
json_array(col."key") AS pathKey
FROM collections AS col
WHERE parentColID IS NULL
UNION ALL
SELECT
col.collectionID AS colID,
col.collectionName AS colName,
col.parentCollectionID AS parentColID,
col."key" AS colKey,
json_insert(cols_unsorted.pathName, '$[#]', col.collectionName
) AS pathName,
json_insert(cols_unsorted.pathKey, '$[#]', col."key") AS pathKey
FROM cols_unsorted, collections AS col
WHERE cols_unsorted.colID = col.parentCollectionID
),
cols_unpacked AS (
SELECT *
FROM cols_unsorted
ORDER BY substr(pathName, 1, length(pathName) - 1)
),
cols_packed AS (
SELECT
colID,
pathName,
json_object('colID', colID, 'colName', colName,
'colKey', colKey, 'pathName', json(pathName),
'pathKey', json(pathKey)) AS collection
FROM cols_unpacked
ORDER BY substr(pathName, 1, length(pathName) - 1)
)
SELECT * FROM cols_packed;
The following query returns the list of all "filed" items and associated collections as JSON objects.
WITH RECURSIVE
cols_unsorted AS (
SELECT
col.collectionID AS colID,
col.collectionName AS colName,
col.parentCollectionID AS parentColID,
col."key" AS colKey,
json_array(col.collectionName) AS pathName,
json_array(col."key") AS pathKey
FROM collections AS col
WHERE parentColID IS NULL
UNION ALL
SELECT
col.collectionID AS colID,
col.collectionName AS colName,
col.parentCollectionID AS parentColID,
col."key" AS colKey,
json_insert(cols_unsorted.pathName, '$[#]', col.collectionName
) AS pathName,
json_insert(cols_unsorted.pathKey, '$[#]', col."key") AS pathKey
FROM cols_unsorted, collections AS col
WHERE cols_unsorted.colID = col.parentCollectionID
),
cols_unpacked AS (
SELECT *
FROM cols_unsorted
ORDER BY substr(pathName, 1, length(pathName) - 1)
),
cols_packed AS (
SELECT
colID,
pathName,
json_object('colID', colID, 'colName', colName,
'colKey', colKey, 'pathName', json(pathName),
'pathKey', json(pathKey)) AS collection
FROM cols_unpacked
ORDER BY substr(pathName, 1, length(pathName) - 1)
),
itemCollections AS (
SELECT
itemID,
json_object('collections',
json_group_array(
json(collection)
ORDER BY substr(pathName, 1, length(pathName) - 1)
)
) AS collections
FROM collectionItems, cols_packed AS cols
WHERE collectionItems.collectionID = cols.colID
GROUP BY itemID
ORDER BY itemID
)
SELECT * FROM itemCollections;
Tags help organize items. Zotero implements a standard tag model (any regular item may be assigned any tags). Hence, the tags and many-to-many itemTags tables (shown to the right from the items table in the ERD diagram) store the tags and assignment information, respectively.
The following query lists items and all associated tags as JSON objects:
SELECT itemTags.itemID,
json_object('tags',
json_group_array(
json_object('id', itemTags.tagID, 'name', name, 'auto', type)
ORDER BY tags.name COLLATE NOCASE
)
) AS tags
FROM itemTags, tags
WHERE itemTags.tagID = tags.tagID
GROUP BY itemTags.itemID
ORDER BY itemTags.itemID;
In addition to conventional tags and hierarchical collections, Zotero provides a third tool for organizing items - relations, which defines related collections. The associated information is stored in the many-to-many itemRelations table (shown next to the bottom-right corner of the items table in the ERD diagram). While there are particular scenarios for which this feature might be a natural fit, the same information can be encoded using tags or collections. Further, because no metadata can be associated with relations, it is, IMHO, barely usable.
If Better BibTex for Zotero is installed, the following query attempts to attach its database and return citation keys.
DETACH better_bibtex;
ATTACH (
SELECT substr(file, 1, length(file) - 13) || 'better-bibtex.sqlite' AS prefix
FROM pragma_database_list()
WHERE name = 'main'
) AS better_bibtex;
SELECT itemID, itemKey, citationKey, pinned
FROM better_bibtex.citationkey
ORDER BY itemID;