File Storage

2024-12-04

Author: @opqdonut

Involved: N.N. A.B.

Problem statement

We need to store files related to projects somewhere. The files can be large (one example was 1GB). Some of the files need to be processed locally in the backend. An example is FOOB files, which get converted into geojson using command-line tools.

Solutions

Current situation: files in postgres

Currently, we store the files in postgres as blobs.

Pros:

Simple to implement

Cons:

Probably won't scale in performance
Will need a large disk allocated for postgres
Files get received and served via our application backend: potential performance risk

A) Postgres large objects

Postgres large objects are meant for uses like this.

Pros:

Drop-in replacement for current situation

Cons:

Not familiar to development team
Files still stored in the database, so performance degradation if db is possible
Will need a large disk allocated for postgres
Files get received and served via our application backend: potential performance risk

Questions:

How is the performance?
Backups?

B) Files on disk

Allocate a large disk for the files, keep paths to the files in the db.

Pros:

Fairly straightforward to implement
Can separate database storage from file storage (e.g. fast small disk for postgres, large slow one for files)

Cons:

Will need diskspace management
Will need a seprate backup strategy

C) Object storage

An object storage service like Hetzner's S3 clone is meant for WORM [write once read many times] workloads like this.

Pros:

No need for separate backup strategy
Files can be served from the object storage directly, instead of via the backend
Files can be uploaded directly to the object storage from the user's browser, instead of via the backend
The cloud has infinite space
Possibility to make the backend stateless in the future
Team is familiar with this approach (used in at least N.N's previous project)

Cons:

Will need more code
Will need a mock implementation for local development
Will need a migration script when we want to drop support for in-database blobs

Questions:

How do we handle local processing of eg. FOOB files?
How do the costs of Hetzner's S3 compare to disk space?
Which library to use with the S3 API?

Chosen solution

Let's go with C: Hetzner's Object Storage. It feels like the modern solution to storing files, and lets us not worry about backups & quotas. There are multiple providers of the S3 API, including local ones, so we can always change providers later if needed.

Implementation plan

Initially:

Use Hetzner's S3 via the AWS Java SDK v2
- using the Java SDK directly recommended by colleagues from XYZ
- the Java SDK v2 lets us pull in only the S3 API instead of the whole AWS API
Use minio to run a local S3 for development purposes
- Nice guide here
- Can be also used for integration & end-to-end tests
- Smaller tests can use a simpler fake written in clojure
- Also evaluated localstack, but it doesn't have persistence (objects disappear when container restarts)
Handle uploads via the backend, just like now
- FOOB handling can work as currently
- less risk of concurrency issues (eg. file uploaded to S3 but registering it with the backend fails)
Serve files to the frontend directly from S3
- Use presigned GET urls
Keep both code paths: storing files in postgres; storing files in S3
- Will delay need to migrate existing files
Keep geojsons in the database

Upload files directly from the browser to S3
- Hetzner's S3 doesn't support notifications, so the browser must tell the backend when the upload is done
- The backend will need to download the file from S3 for things like FOOB conversion
Remove support for files stored in postgres
- Migrate existing files to S3

Future possibilities:

Store geojson data in S3 objects as well

opqdonut/README.md

Architecture Decision Records

File Storage

Problem statement

Solutions

Current situation: files in postgres

A) Postgres large objects

B) Files on disk

C) Object storage

Chosen solution

Implementation plan