We want only public profiles to be searchable and visible in the user directory search. By public profiles ("canonical" maybe?) we mean the thing returned by GET /profile/USER_ID.
We're getting names into the user directory (user_directory and user_directory_search) via calls to update_profile_in_user_dir.
There are five calls:
main/user_directory.py(regenerate the user directory)_populate_user_directory_process_users. 1 call site, reads data from theprofilestable. Only does stuff ifuser_directory_search_all_users._populate_user_directory_process_rooms. 1 call site, reads from theroom_membershipstable
handlers/user_directory.py(incrementally update the user directory)handle_local_profile_change. 4 call sites, all reading fromprofiles_handle_profile_change. 1 call site, reads from a room event. Seems to handle both local and remote users._handle_new_user. 2 call sites, one reading from a room event, andother fromroom_memberships.
I think the idea behind these two tables is:
- combine local user profile info (
profiles) with remote user profile info (hinted at inroom_membership) into a single place. (At the cost of maintaining a second source of truth?) - factor out the text search stuff into a separate table, so we can limit the scope of postgres/sqlite differences
Any per-room "private names" will appear in room membership events, which end up getting stored in room_memberships. So per-room names can get into the user_directory table via the bolded entries above.
To fix the leak without changing the existing source code too much, we'd need to ensure that the bolded entries always refer to a "public profile". When the user in question is local to this homeserver, we can use query the profiles table. Otherwise... well, we have no alternative but to query the remote user's homeserver. We'll have to do that in an efficient way to stop ourselves from spamming the same profile request in short succession if a lot of rename events come through.
Profile update events will come through to lots of rooms. Wait until we've had $THRESHOLD seconds without a duplicate request before firing one off. (Or don't fire one off unless the last time we tried was $THRESHOLD seconds ago?
I think the code paths would be simpler if we made the config flag user_directory_search_all_users just decide which WHERE clause to use when searching the user directory. I don't think it needs to control the data we put into the user_directory table.
As a bonus, doing this would mean that a server admin can toggle this config flag live without having to restart their server.
The cost of doing so is an extra entry in the user_directory and user_directory_search table for users on this homeserver who aren't in any rooms. I think this is negligible. The search query we use will
- Remove the guards in
DeactivateAccounHandler:activate_account,ProfileHandler:set_displayname,ProfileHandler:set_avatar_url,RegistrationHandler:register_user. This means changes to local profiles will immediately propagate to theuser_directorytables. - Remove the guards in
_populate_user_directory_createtablesand_populate_user_directory_process_usersso that these will always scan for local users.
TODO: want to exclude "support users" from directories at all stages. Can a user change from being support to not support? Will support users always be local to this homeserver?
The above can then be solely responsible for adding and updating local users' profiles to user_directories. So the UserDirectoryHandle can completely ignore any per-room profiles for local users.
- Change
_populate_user_directory_process_roomsto only adds user_directory entries for remote users. Do this by skipping local users in the loop overusers_with_profile. Note that we need to still consider these users for the purposes of computingusers_in_public_roomsandusers_who_share_private_rooms.- TODO: why do we skip over remote users if the room is not public? (last
elseblock)
- Because we're only ever going to be searching
users_who_share_private_roomswhere the first user is local?
- TODO: What's an appservice? Why might we not want to add its users to
users_in_public_roomsandusers_who_share_private_rooms?
- TODO: why do we skip over remote users if the room is not public? (last
- Change
_handle_deltasto continue iftyp == EventTypes.Member,_get_key_changeindicates no membership change and the user is local
- Find a name for
_populate_user_directory_process_roomswhich hints that we'll also populate the "who's in what room" tables - Use an enum instead of
Optional[bool]for the return type of_get_key_change
Clarify the role of UserDirectoryHandler. I think it has three purposes:
- Deciding the result of a user search (
POST /user_directory/search) - Exposing hooks for other parts of the application to call when local user profiles are added, changed or removed
- Respond to room changes
- publicness change
- user added or removed
- user changed their profile within that room
I suggest
- Move
notify_new_eventdown belowhandle_user_deactivated - Rename
handle_user_deactivated -> handle_local_user_deactivated
I think this means we'll need some kind of queuing mechanism to deduplicate and send out requests across the federation.
I think we should treat this as a separate piece of work because the surgey above already feels intimidating.