日本語 => https://gist.github.com/tateisu/3d98290f2b72d12ba5f1b977a0d5743c
SQL exapmle that deletes local copies of external posts flowing from the FTL or relays and that are not referenced long time by local users.
Note: It is "reduction" of data increase , not "to be able to operate infinitely with a fixed disk capacity".
while running Mastodon instance, toots data gradually accumulates. I understand that my instance should keep records of local user's own posts and external posts that local users are interested in. However, there is no reason to long keep records of the external posts such as "posts by external users who are no longer followed" and "posts by external users unconditionally flowing through relays". Can we delete the copy of unreferenced external posts for reducing the increase in recorded data? Especially with relay, Can we avoid to judge like "Do not use relays because it increasing posts data too much"?
- PostgreSQL 9.5 or later.
- Prepare in advance
tootctl media remove
.
Since this SQL is very slow, please make indexes in advance.
create index index_accounts_local on accounts(id) where domain is null;
You can erase it after work.
create index index_status_pins_status_id on status_pins(status_id);
You can erase it after work.
create index index_media_attachments_remote_url on media_attachments(remote_url) where remote_url is not null;
After deleting statuses, PgHero shows slow query like as SELECT "media_attachments".* FROM "media_attachments" WHERE "media_attachments"."status_id" IS NULL AND "media_attachments"."remote_url" = ? ORDER BY "media_attachments"."id" ASC LIMIT ?
.
I made index to avoid this.
I do not know what conditions this query will occur due to lack of survey, I can not be said clearly when we can delete this index.
DELETE FROM statuses WHERE id in (
SELECT id FROM statuses
where not local
and id < (EXTRACT(EPOCH FROM ((TIMESTAMP 'now')-interval '100 days')) * 65536000)::bigint
and reblog_of_id is null
and in_reply_to_id is null
and id not in ( select status_pins.status_id from status_pins where statuses.id = status_id)
and id not in ( select mentions.status_id from mentions where statuses.id = status_id)
and id not in ( select sb.in_reply_to_id from statuses as sb where statuses.id = sb.in_reply_to_id)
and id not in ( select favourites.status_id from favourites where statuses.id = status_id
and favourites.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
and id not in ( select sb.reblog_of_id from statuses as sb where statuses.id = sb.reblog_of_id
and sb.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
and account_id not in ( select follows.target_account_id from follows where statuses.account_id = follows.target_account_id
and follows.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
order by id asc limit 2000
);
Since the DELETE command has an exclusive lock of the entire table, it causes other processes to stagnate. This query may be quite slow (depending on server specs), we should limit the number of deletions at once.
Also, when used in scripts etc., you can get "list of deleted status IDs" by appending returning id
at the end of delete statement.
Check the query conditions separately in the following sections.
Note: About the using of not in
instead of inner join ~ where null
or not exists
. It has impact on the execution plan of the query on PostgreSQL 9.6.x . When using not exists
, Merge Anti join ~ Materialize
is often used as an execution plan, and it runs subqueries even outside the specified range of status ID. then not exists
is rewritten with not in
, SubPlan will be processed to Cond Check inside the range scan of the statuses table instead of join. If you look at cost alone, it becomes heavy, but the rows handled in subquery is drastically reduced.
Corresponds to not local
.
Posts and boosts by local users should not be deleted.
Corresponds to id < (EXTRACT(EPOCH FROM ((TIMESTAMP 'now')-interval '100 days')) * 65536000)::bigint
.
We should delete posts only older than the number of days specified in tootctl media remove_remote
(default 7 days).
Note: Mastodon's status ID is Snowflake ID, multiplying unix time by 65536000 will replace date range specification to ID range specification.
Note: now()
is not a constant expression, but (TIMESTAMP 'now')
is constant expression.
Corresponds to id not in ( select status_pins.status_id from status_pins where statuses.id = status_id)
.
Pinned statuses are a kind of "current state" displayed in the user profile, so should not be deleted even if it is old.
Corresponds to reblog_of_id is null
.
Deleting a post that expresses a boost affects the display of "people boosting a post". Since posts expressing Boost do not flowing through Relay or FTL, there is no aggressive reason to delete.
Corresponds to :
in_reply_to_id is null
and id not in ( select mentions.status_id from mentions where statuses.id = status_id)
and id not in ( select sb.in_reply_to_id from statuses as sb where statuses.id = sb.in_reply_to_id)
Local user may be involved only in a small part of the conversation tree, in which case it is desirable for local user to be able to view the entire conversation tree. Therefore, it should not be deleted.
Note: I do not mind deleting a conversation tree where the local user is not involved at all, but this time the script does not make such a complicated judgment, all conversation tree or mentions is not deleted.
Corresponds to :
account_id not in ( select follows.target_account_id from follows where statuses.account_id = follows.target_account_id
and follows.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
Corresponds to :
id not in ( select favourites.status_id from favourites where statuses.id = status_id
and favourites.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
Corresponds to :
id not in ( select sb.reblog_of_id from statuses as sb where statuses.id = sb.reblog_of_id
and sb.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
This is not included in the query example above.
If this condition is not included, so when boosted posts are deleted Posts indicating that boosting is also deleted
because db/schema.rb
hasadd_foreign_key 'statuses', 'statuses', column: 'reblog_of_id', on_delete:: cascade
.
As a result, it seems that the boost disappears in the account TL of the corresponding remote user.
Some people may oppose the "boost by followers" can not be seen. If you want to add this condition, can you write as? :
id not in ( select sb.reblog_of_id from statuses as sb where statuses.id = sb.reblog_of_id
and ( sb.account_id in ( select accounts.id from accounts where accounts.domain is null )
or sb.account_id in ( select follows.target_account_id from follows where sb.account_id = follows.target_account_id
and follows.account_id in ( select accounts.id from accounts where accounts.domain is null)
)
)
- this script does not delete posts including conversations and mentions, then the equivalent to
Statuses.unlink_from_conversations
should not be necessary. - All other foreign keys that references status id should have
on_delete:: cascade
oron_delete:: nullify
definition. - The media attachments will lost relation to status. it will be deleted by MediaCleanupScheduler that scheduled on sidekiq.
I tried running it at mastodon.juggler.jp.
It is not an exact figure because there are increased posts during the execution, but about 11% posts could be deleted.
This time it should have deleted "posts by people who is no longer followed by local users" in the time,
but if I try at half a year later, I should be able to delete also "Toot that came from relay but not touched at all by local users".