Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save zlovatt/3664bf0c01292b9ae78d272548411b9d to your computer and use it in GitHub Desktop.
Save zlovatt/3664bf0c01292b9ae78d272548411b9d to your computer and use it in GitHub Desktop.
paperless-ngx post-consumption script to extract details from filename

paperless-ngx Post-Consumption Data Extraction from Filename Script

This is a paperless-ngx post-consumption script to extract document data based on the original filename.

It was initially developed by @gulpman and posted in this discussion thread. I've made further changes, detailed below.

Parsing

The script assumes the following pattern: {Date} - {Correspondent} - {Title}, and can parse the following cases:

  • 2023-01-23 - My Correspondent - Document_Title
  • 20230123 - My Correspondent - Document_Title
  • 20230123-My Correspondent - Document_Title

All of these will return the same data:

  • Date: 2023-01-23
  • Correspondent: My Correspondent
  • Title: Document Title

If a document doesn't have a full creation date (only having a year and month), the script will assume the first of the month, e.g.

  • 2023-01-xx - Correspondent - Title
  • 202301xx - Correspondent - Title
  • 202301xx-Correspondent - Title

all become 2023-01-01.


Other Features & Behaviour

  • The script can add a tag to all processed documents to denote that they've been altered in some way
  • If a correspondent doesn't exist, it will be created

Todo

  • Add support for the following cases:
    • Date and Title (without Correspondent), e.g.
      • 2023-01-09 - Document Title
    • Custom fields to cover "start" and "end" date spans, e.g.
      • 2023-01-09 - My Correspondent - Invoice No. 123456 - 2022-12-01 to 2022-12-31
      • 2023-01-09 - My Correspondent - Phone Bill 2022-12-01 to 2022-12-31

Changes

All notable changes to this project will be documented in this file.

These are solely the changes I've made to the script, based on the original by @gulpman

Added

  • Document tagging
  • Optional config .env parsing

Changed

  • Hoisted config to top of file
  • Generalized API get/creation functions
  • Replaced requests with httpx so as to remove external dependency
  • Replaced user/pass auth with API auth token (found under 'Edit Profile' in paperless-ngx UI)

Fixed

  • Some incorrectly-passed arguments
  • Some double-spacing & spelling issues
# If you want to use this .env file,
# Uncomment the relevant lines in post-consumption.py
# Credentials
API_AUTH_TOKEN=authtoken
# Connection info
PAPERLESS_URL=http://localhost:8000 ## Use the internal URL!
SESSION_TIMEOUT=5.0
# Tag will be added to the document if both:
# - ADD_TAG is set to True
# - The script was able to detect all of date, correspondent, and title
#
# This is helpful for debugging
ADD_TAG=True
MODIFIED_TAG=PostModified
#!/usr/bin/env python3
import httpx
import os
import re
import sys # needed to be able to "nicely" exit script :)
# Load config from .env file instead of hardcoding config values
from dotenv import load_dotenv
#######################################################
# Config
#
# Ensure your .env file follows the provided example
# Or, if you rather hardcode values, remove the following lines and uncomment the lower section.
#######################################################
load_dotenv()
# Credentials
API_AUTH_TOKEN = os.getenv("API_AUTH_TOKEN")
# Connection info
PAPERLESS_URL = os.getenv("PAPERLESS_URL", "http://localhost:8000")
SESSION_TIMEOUT = float(os.getenv("SESSION_TIMEOUT", 5.0))
# Tagging
ADD_TAG = bool(os.getenv("ADD_TAG", True))
MODIFIED_TAG = os.getenv("MODIFIED_TAG")
# If not using an .env file, remove the previous lines and uncomment the below
# API_AUTH_TOKEN="authtoken"
# PAPERLESS_URL = "http://localhost:8000" # use the internal url!
# SESSION_TIMEOUT = 5.0
# ADD_TAG = True
# MODIFIED_TAG = "PostModified"
#######################################################
# Filename parsing
#######################################################
def parseFileName(filename: str):
"""
Parses the given string with the Regexp defined inside the function to extract
a date string, correspondent string and title string
In case one of the three return values could not be extracted from the input string None will be returned instead.
In my case I have all files already saved in the following format:
2023-01-23 - My Phone Company - Invoice for December 2022
I want to extract my pattern, like this:
- 2023-01-23 -> date
- My Phone Company -> correspondent
- Invoice for December 2022 -> title
I did extend the regexp to handle the following filename formats as well
2023-01-xx - My Phone Company - Invoice for December 2022
20230123 - My Phone Company - Invoice for December 2022
20230123-My Phone Company - Invoice for December 2022
202301xx - My Phone Company - Invoice for December 2022
202301xx-My Phone Company - Invoice for December 2022
"""
# initialize the parts we want to find with regexp - if they are not found they are not known thus resulting in
# error messages later on
date_extracted = None
correspondent_extracted = None
title_extracted = None
# V3 of my matching regexp
pattern = re.compile(r'(\d{4}-?\d{2}-?(\d{2}|xx))(\s?-\s?)(.*?)(\s-\s)(.*)')
# 1: (\d{4}-?\d{2}-?(\d{2}|xx)) is the date in ISO-Date order either with or without hyphens (e.g. 2023-01-22 or 20230122.
# there is also the possibility to write the day of month as xxx e.g. 2023-01-xx if in the
# document the date was given as "January 2023"
# 2: (\d{2}|xx) is either the day of month or xx if the document contains just a date as "January 2023"
# 3: (\s?-\s?) is my divider for fields, normally it is " - " but sometimes my wife saves stuff without the spaces, so "-"
# is a positive match as well
# 4: (.*?) is my Correspondent, e.g. "My Phone Company" It is used non-greedy as otherwise we get in trouble with the "-"
# 5: (\s-\s) is my divider between correspondent and title. Here the spaces must be there, otherwise a title like A-Team
# would lead to big trouble :)
# 6: (.*) is the rest of the filename, by definition my title, e.g. "Bill for December 2022"
# So we need to take care about matching group 1, 4 and 6
findings = pattern.match(filename)
if findings:
date_extracted = findings.group(1)
correspondent_extracted = findings.group(4)
title_extracted = findings.group(6)
# post processing of extracted date only if the regexp was successful
if date_extracted != None:
# if the date field contains "xx" replace it with "01" as the first day of the month
date_extracted = date_extracted.replace("xx", "01")
# if the date did not contain hyphens, add them back to be in ISO format
if date_extracted.find("-") < 0:
year = date_extracted[0:4]
month = date_extracted[4:6]
day = date_extracted[6:8]
date_extracted = f"{year}-{month}-{day}"
print("Result of RegExp:")
if date_extracted == None:
print("No Date found! Exiting")
sys.exit()
if correspondent_extracted == None:
print("No Correspondent found! Exiting")
sys.exit()
if title_extracted == None:
print("No Title found! Exiting")
sys.exit()
print(f"Date extracted : '{date_extracted}'")
print(f"Correspondent extracted: '{correspondent_extracted}'")
print(f"Doc Title extracted : '{title_extracted}'")
return date_extracted, correspondent_extracted, title_extracted
#######################################################
# Database querying
#######################################################
def getItemIDByName(item_name: str, route: str, session: httpx.Client, timeout: float):
"""
Gets an item's ID by looking up its name and API route.
If no item exists, returns None.
This function handles a (potentially impossible?) edge case of multiple items existing under that name; input welcome!
"""
# Query DB for data matching name in route
response_data = _get_resp_data(f"{route}?name__iexact={item_name}", session, timeout)
response_count = response_data["count"]
# If no item exists, return None
if response_count == 0:
print(f"No existing id found for item '{item_name}'.")
return None
# If one item exists, return that
elif response_count == 1:
new_item_id = response_data["results"][0]['id']
print(f"Found existing id '{str(new_item_id)}' for item: '{item_name}'")
return new_item_id
# If multiple items exist, return the first and print a warning
elif response_count > 1:
print(f"Warning: Unexpected situation – multiple results found for '{item_name}'. Feedback welcome.")
new_item_id = response_data["results"][0]['id']
return new_item_id
# This would be strange.
else:
print("Warning: Unexpected condition in getItemIDByName!")
return new_item_id
return new_item_id
def createItemByName(item_name: str, route: str, session: httpx.Client, timeout: float, skip_existing_check: bool = False):
"""
Creates a new item in the database given its name and API route.
An optional parameter is presented to skip checking for whether the item already exists.
"""
new_item_id = None
# Conditionally check whether the item exists
if skip_existing_check == False:
new_item_id = getItemIDByName(item_name, route, session, timeout)
if new_item_id != None:
return new_item_id
# Create item at given route
data = {
"name": item_name,
"matching_algorithm": 6,
"is_insensitive": True
}
response = session.post(route, data=data, timeout=timeout)
response.raise_for_status();
new_item_id = response.json()["id"]
print(f"Item '{item_name}' created with id: '{str(new_item_id)}'")
# If no new_item_id has been returned, something went wrong - do not process further
if new_item_id == None:
print(f"Error: Couldn't create item with name '{item_name}'! Exiting.")
sys.exit()
return new_item_id
def getOrCreateItemIDByName(item_name: str, route: str, session: httpx.Client, timeout: float):
# Check for existing item ID
existing_id = getItemIDByName(item_name, route, session, timeout)
# If no existing ID found, create
if existing_id == None:
print(f"No item found with name: '{item_name}'; creating...")
existing_id = createItemByName(item_name, route, session, timeout, skip_existing_check = True)
return existing_id
def _get_resp_data(route: str, session: httpx.Client, timeout: float):
response = session.get(route, timeout = SESSION_TIMEOUT)
response.raise_for_status()
response_data = response.json()
return response_data
def _set_auth_tokens(paperless_url: str, session: httpx.Client, timeout: float):
response = session.get(paperless_url, timeout = timeout, follow_redirects = True)
response.raise_for_status()
csrf_token = response.cookies["csrftoken"]
session.headers.update(
{"Authorization": f"Token {API_AUTH_TOKEN}", f"X-CSRFToken": csrf_token}
)
#######################################################
# Main
#######################################################
if __name__ == "__main__":
# Running inside the Docker container
with httpx.Client() as sess:
# Set tokens for the appropriate header auth
_set_auth_tokens(PAPERLESS_URL, sess, SESSION_TIMEOUT)
# Get the PK as provided via post-consume
doc_pk = int(os.environ["DOCUMENT_ID"])
# Query the API for the document info
document_api_route = f"{PAPERLESS_URL}/api/documents/{doc_pk}/"
doc_info = _get_resp_data(document_api_route, sess, SESSION_TIMEOUT)
# Extract the currently assigned values
doc_title = doc_info["title"]
print(f"Post-processing input file: '{doc_title}'...")
# parse file name for date_created, correspondent and title for the document:
extracted_date, extracted_correspondent, extracted_title = parseFileName(doc_title)
# Clean up title formatting
new_doc_title = extracted_title.replace("_", " ")
# Get correspondent ID
correspondent_api_route = f"{PAPERLESS_URL}/api/correspondents/"
correspondent_id = getOrCreateItemIDByName(extracted_correspondent, correspondent_api_route, sess, SESSION_TIMEOUT)
data = {
"title": new_doc_title,
"correspondent": correspondent_id,
"created_date": extracted_date
}
# Conditionally add a tag to the document.
doc_tags = doc_info["tags"]
new_doc_tags = doc_tags
if ADD_TAG == True:
tags_api_route = f"{PAPERLESS_URL}/api/tags/"
tag_id = getOrCreateItemIDByName(MODIFIED_TAG, tags_api_route, sess, SESSION_TIMEOUT)
# Add the new tag to list of current tags
new_doc_tags.append(tag_id)
# Set document tags
data['tags'] = new_doc_tags
# Print status
print("Regexp Matching was successful!")
print(f"Date created: '{extracted_date}'")
print(f"Correspondent: '{extracted_correspondent}'")
print(f"Title: '{new_doc_title}'")
print(f"Tag IDs: '{str(new_doc_tags)}'")
# Update the document
resp = sess.patch(
f"{PAPERLESS_URL}/api/documents/{doc_pk}/",
data=data,
timeout=SESSION_TIMEOUT,
)
resp.raise_for_status()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment