Challenge

Currently, there is no cleanup process with Event Triggers and this is prone to cause heavy load on the DB as well as impact performance of event triggers. Some of the customers have their own setup to cleanup the events, but not in an optimal way.

Proposal

Give an ability to "VACCUM" old events from the event_log table to contain size bloat as well as sustain performance of Event Triggers. Hasura should be able to do this cleanup automatically, there can also be few user given configurations to tune this cleanup process.

Design

Tuning goals

Let's refer to the events we want to delete as dead events and these events will be events that have been processed or archived.

Before looking at the actual configuration parameters, let’s briefly discuss what are the high-level tuning goals, i.e. what we want to achieve when changing the parameters:

cleanup dead events – Keep the amount of dead events low, so as to not waste unreasonable amount of disk space, prevent index bloat and keep the fetch events query fast.
minimize cleanup impact - The cleanup should be as fast as possible using as little resources as possible. Ideally, the customer should just configure the cleanup once and then forget about it.

Feature discussion

The auto cleanup feature can broadly be divided into two parts:

The cleanup action
Automating the cleanup action

The cleanup action

The cleanup action will calculate the number of dead events that are present in the event log table and then based on the user configuration will delete corresponding amount of the dead events.

Ideas

If there are a large number of dead events, then don't delete it in one shot. We can delete it in batches to not affect the performance of the database. We can let the user to configure the maximum number of rows to be deleted in one cleanup action.
We don't want our cleanup action to be blocking other queries. So, we can add a statement timeout at the SQL layer.
We could choose to manually vacuum the table after deleting the dead events. This option is only available on Postgres as SQL server doesn't provide an API to shrink (vacuuming in postgres = shrink in SQL Server) a particular table. We could also just rely on postgres auto vacuum.
Do we also drop the invocation log along with the dead event? One thing to consider is that inserts into the event_invocation_logs do not depend upon the size of the table. The only place the graphql-engine deals with invocation logs is when the console fetches invocation logs along with the events. But now, after the cleanup, if there is no event, console will not be able to fetch the invocation log. On the other hand, each invocation log takes up a considerable amount of space, because it contains the request, response payloads in the JSON blob.
Be able to pause the auto cleanup process. This will help with metadata apply and migrations.
Provide a manual API that accepts config to do the cleanup action.
Delete only events that are older than a certain time period (for ex: delete events that are older than 7 days).

Automating the cleanup action

There are a couple of ways we can automate the clean up action:

1. Trigger a cleanup action based on a threshold.

For example: the user can say they want to trigger a cleanup if more than 20% of the events present are dead events.

Pros:

Cleanups will be done only when needed.

Cons:

Hard to get a performant implementation of this.
Do cleanups at regular intervals

2. The user configures the cleanup action to run every x amount of hours or use cron scheduler.