Currently, there is no cleanup process with Event Triggers and this is prone to cause heavy load on the DB as well as impact performance of event triggers. Some of the customers have their own setup to cleanup the events, but not in an optimal way.
Give an ability to "VACCUM" old events from the event_log table to contain size bloat as well as sustain performance of Event Triggers. Hasura should be able to do this cleanup automatically, there can also be few user given configurations to tune this cleanup process.
Let's refer to the events we want to delete as dead events and these events will be events that have been processed or archived.
Before looking at the actual configuration parameters, let’s briefly discuss what are the high-level tuning goals, i.e. what we want to achieve when changing the parameters:
-
cleanup dead events – Keep the amount of dead events low, so as to not waste unreasonable amount of disk space, prevent index bloat and keep the fetch events query fast.
-
minimize cleanup impact - The cleanup should be as fast as possible using as little resources as possible. Ideally, the customer should just configure the cleanup once and then forget about it.
The auto cleanup feature can broadly be divided into two parts:
- The cleanup action
- Automating the cleanup action
The cleanup action will calculate the number of dead events that are present in the event log table and then based on the user configuration will delete corresponding amount of the dead events.
-
If there are a large number of dead events, then don't delete it in one shot. We can delete it in batches to not affect the performance of the database. We can let the user to configure the maximum number of rows to be deleted in one cleanup action.
-
We don't want our cleanup action to be blocking other queries. So, we can add a statement timeout at the SQL layer.
-
We could choose to manually vacuum the table after deleting the dead events. This option is only available on Postgres as SQL server doesn't provide an API to shrink (vacuuming in postgres = shrink in SQL Server) a particular table. We could also just rely on postgres auto vacuum.
-
Do we also drop the invocation log along with the dead event? One thing to consider is that inserts into the
event_invocation_logs
do not depend upon the size of the table. The only place the graphql-engine deals with invocation logs is when the console fetches invocation logs along with the events. But now, after the cleanup, if there is no event, console will not be able to fetch the invocation log. On the other hand, each invocation log takes up a considerable amount of space, because it contains the request, response payloads in the JSON blob. -
Be able to pause the auto cleanup process. This will help with
metadata apply
and migrations. -
Provide a manual API that accepts config to do the cleanup action.
-
Delete only events that are older than a certain time period (for ex: delete events that are older than 7 days).
There are a couple of ways we can automate the clean up action:
For example: the user can say they want to trigger a cleanup if more than 20% of the events present are dead events.
Pros:
- Cleanups will be done only when needed.
Cons:
-
Hard to get a performant implementation of this.
-
Do cleanups at regular intervals
Pros:
- Easy to implement
Cons:
-
Cleanups maybe triggered even when its not needed and likewise cleanups may not be triggered when its really needed (depends on the cleanup interval configuration).
-
For the customer, it will be challenging to come up with the optimal auto cleanup interval.
Open questions