lyzadanger/group-ids.md

Last active October 17, 2018 14:10

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/lyzadanger/2f63b4e3e00b0073cda113cd9ee38599.js"></script>
Save lyzadanger/2f63b4e3e00b0073cda113cd9ee38599 to your computer and use it in GitHub Desktop.

Thinking about group identifiers...

Raw

After a brief jaunt through code and a think, I think I would summarize my reluctance to try to solve the client-determined identifier need for groups using pubid as follows. This doesn’t touch on the authorization or invite bits involved with pubid, just some high-level things.

I don’t know if this is deep enough to claim that I’ve looked at how hard it would be to update pubid in all the various places, but I realize my argument stands even if group.pubid were easy to modify. I feel like it inherently does something else than the notion we’re after here. Here are my thoughts!

One of the current core things about pubids is that they are known to be unique for a resource across all of our data. This is reflected in their use to identify a single resource in 17 different routes in our application at present.
However, our need for a client-determined identifier needs to enforce uniqueness per authority — the same identifier could be reused in multiple authorities. I believe this to be a design necessity, alert me if I’m wrong.
If we changed pubid to be unique-per-authority, we’d have a bunch of routes that no longer point to unique resources, but possibly one URl that could represent multiple resources. Uh oh!
organization also makes use of a pubid. If we changed the way pubid worked for groups (vis-a-vis their uniqueness or composition), organization and group would be out of sync, which introduces some complexity—organization pubids could be relied upon to point to a single resource, but not group pubids. Uh oh!
Introducing a groupname field would be concordant with the way identifiers work for users (username), and could be brought into yet more harmony by the adoption of a groupid syntax in line with the way we compute userids

We have this system, roughly, which can be applied to various resources:

An internal unique ID, never exposed (i.e. an auto-increment)
A service-generated, web-facing, guaranteed-unique pubid. This pubid has some extra meaning WRT group right now (having possession of it gives you powers). In the future, I would suggest this should perhaps not be the case, but I’m going to put that aside for now.
A client-determined, unique-per-authority identifier (user.username,;proposed group.groupname). The client/authority owns managing these for uniqueness.
Thus: There is a clear line between pubid (owned/managed by the service) and *name (owned/managed by the authority’s client(s)).
A computed syntax that can be used to form a service-wide unique identifier by combining *name with the authority, e.g. the way we compute userid (account:<username>@<authority>). We could also have a groupid syntax.

Extending this a little into the future (fuel for ruminating, not immediate):

users don’t have pubids. Perhaps they should? Could be useful and consistent. Discuss? :)
If our resources had a consistent combination of these, resource endpoints could ostensibly accept either a pubid or a *id (userid, groupid, etc.) syntax…

Second Revision!

I’m starting to realize that there is an inherent incompatibility between all of the following requirements in play at once:

Have a client-definable, unique-to-authority identifier for a resource AND
That identifier must be mutable AND
We want an UPSERT endpoint that uses that mutable identifier as the, well, identifier for a resource

My hunch is that the first two requirements are necessary. The first is relatively self-explanatory (and how we got here in the first place) and the second follows naturally, especially as—as Rob pointed out, thank you—the username field on user is mutable, and this identifier is supposed to be client-owned—thus the client should be in control and able to mutate that property’s value.

However, the third requirement arises from a desire to limit requests made from the LMS app—UPSERT operations are not actually RESTful and trying to impose them here is starting to show some fracture lines. Erp.

My earlier proposal of an endpoint in the style of PUT /api/groups/{groupid} where groupid is a userid-like construct, e.g. ”group:[email protected]” is not sound, as the somegroupidentifier could be mutated—and even potentially in that very request. Yikes. That is not good.

(Note: I’m going to call the client-owned ID groupname and the service-owned ID pubid in the following but that doesn’t imply those names are baked; I’m also using an invention of groupid as defined above, similar to userid, but that’s provisional, of course, too).

So, what WOULD be REST-ful?

Create a new group: POST /api/groups (with a groupname in the request body); no change here;
Retrieve a group:
- GET /api/groups/{pubid} OR
- GET /api/groups?groupname=somegroupidentifier&authority=myauthority.org OR GET /api/groups?groupid=group:[email protected]
Update a group: PATCH /api/groups/{pubid}
- Totally legal to update the group’s groupname in this request

In this model, we definitely need to retain something pubid-like. We need that truly unique, system-wide identifier. It needs to be something that cannot be changed by the client.

And in this model, the pubid is the unique reference to the resource; thus the modeling of retrieving-by-groupname-or-groupid is really a search function.

And how would LMS consume this?

With respect to groups, on launch:

Attempt to retrieve the group using the client-determined identifier: GET api/[email protected]
- If 404 response, the group doesn’t exist. If current LTI user is an instructor: POST /api/groups with request body containing desired identifier (i.e. create the group)
At this point, we should have a 200 response with the group resource in it—which includes pubid
PATCH /api/groups/{pubid} with any fields you want to change in the body (or the same request body as group-create, if you want). In any case, a PATCH

Yes, I concede (sorry!): this is 2 requests related to groups on app launch (possible max of 3 if group doesn’t exist yet), but I think the API design may be more correct. It also gives the LMS app an opportunity to decide whether it wishes to perform metadata sync (i.e. updating) of resources on every launch or just sometimes/less frequently. It could also allow for a situation in which the requesting client wants to verify that the resource it retrieves is indeed the one it intended to retrieve before blindly updating it.

In Summary

I do believe we still need a distinction between a service-owned, non-mutable, unique reference (currently pubid) and a client-defined, authority-specific identifier (*name or whatever). I don’t see an easy way to retire pubids entirely, without replacing them with something else in this (unique ID) vein.
I’ve banged my head against models for an UPSERT operation for a couple of months now and I keep coming to RESTful dead ends. Sorry!
In this model, technically, groupname is not an ID so much as just-a-regular field that can be used as an identifier in some use cases. See also: username. So our IDs are still pubids. This is important to note: we wouldn’t be introducing a new ID concept here.

This proposed direction satisfies the needs to not store pubids in the LMS database and the ability for the client to define some form of unique identifier, but does not implement UPSERT.

seanh commented Oct 17, 2018 •

edited

Loading

I've organised my responses into a very long comment!!

Executive summary:

Pubids and group provider unique IDs have different and conflicting requirements
We shouldn't liken group provider unique IDs to usernames
We should liken them to user provider unique IDs
They're opaque IDs not names
Multiple simple, orthogonal, do-one-thing-and-do-it-well components are simple and easy to work with. A single do-everything-poorly component is complex and difficult to work with. This follows the existing pattern with users where we have both usernames and provider unique IDs for different purposes

I think the upshot of all this mostly matches Lyza's proposal except I'm suggesting Group.provider_unique_id instead of Group.groupname? (And I think it should be immutable and an upsert API would be better, but these may be less crucial differences.)

For clarity what I think we should do is:

Add optional, per-authority-unique Group.provider_unique_id field (new column on group table)
```
PUT /api/groups/<PROVIDER_UNIQUE_ID>
name: Hypothesis 101
```
Where in the LMS app's case <PROVIDER_UNIQUE_ID> will be a SHA-1 of a couple of LTI launch params that globally uniquely identify the LMS course.
This API only needs to be available to authclients, in fact only the LMS app, for now, and may not ever need to be a public API. (And doesn't preclude us from having a different public API for groups.)

Arguments and details and responses:

Provider IDs aren't like pubids

Lyza's gist makes this point as well but I just wanted to double make it. It's been commented that there might not be a good reason to have two separate group IDs (the existing pubids plus some kind of new, caller-provided ID) in the long-term. I've always wanted to add a new, second type of ID to groups because that's going to be quickest and easiest and I think the requirements and constraints on pubids vs the provider IDs that're required by the LMS app and other third-party integrations are quite different and incompatible. Hopefully this table gets that across:

pubids	Provider IDs
Server-generated	Client-provided
Required	Optional
Globally unique	Unique only within their authority
Short and URL-friendly	Long and provider-friendly (it's easiest if all kinds of digests and UUIDs etc are allowed)
Used in secret links, so must be un-guessable	Not used in secret links, so can be predictable (e.g. a SHA-1 of LTI launch params)
Are the same as organization pubids	Are the same as user provider IDs

Some notes on these

I think we all wish that group pubids being secret and unguessable was not the basis of privacy for private groups. But it is, and will be so for the foreseeable future. Replacing the group invite system is not seriously on the cards right now. So we do need to reckon with this.
- I also understand that you actually can't join LMS groups using their pubid / group invite page because LMS groups are third-party so the group invite page doesn't work for them. But getting group pages (and activity pages generally) working for LMS and third-party accounts is something we hear about all the time, and the LMS are groups are literally private groups (the same as first-party private groups) in the DB. I don't want to have one kind of private group (first-party ones) for which the pubid can be used for joining and another kind of private group (third-party ones) for which it can't, and place another major roadblock in our way if and when we come to making group pages work for third-party groups.
Pubids have to be server-generated because they have to be unguessable because they can be used to join the group. That's why they're generated randomly. So they can't be caller-provided. (This was the main requirement when we designed pubids -- they were not just meant as URL-friendly IDs.)
As Lyza has pointed out pubids are globally unique whereas we only want provider IDs to have to be within-authority unique because we don't want different providers to have to worry about colliding with eachother's IDs.
I also think having a short, URL-friendly, globally unique ID that is guaranteed to exist for every group is nice and potentially useful. It's nice for example that this can be used to form a nice, short unique URL (/groups/xyz123), or display a unique ID for the group, without having to include the authority in it.

We already have two different IDs for users

The idea that we should do everything with just one kind of ID, instead of having different IDs for different purposes,
doesn't follow existing architecture in h: a user in h has a username and also zero-or-more provider_unique_ids.
This was done for good reason as the requirements and constraints on usernames are quite different from those on provider
unique IDs.

Provider IDs aren't like usernames either

Lyza's gist likens caller-provided group IDs to usernames and calls them groupnames. I don't think the needed group IDs are similar to usernames:

Usernames are required, whereas provider unique IDs should be optional (non-LMS groups don't need them)
Usernames are displayed in the UI and used in URLs, so they have to be short and human readable. For provider IDs it's
actually undesirable for them to want to be short, human-readable or URL friendly, as this limits what existing IDs the
provider can use for them, and makes generating IDs harder for the provider.
Provider IDs are opaque IDs, not names.

The group IDs that the LMS app is going to create are going to be SHA-1's of the tool_consumer_instance_guid and context_id. These are IDs, they're not "names". This applies to future similar integrations too. Whenever an integration wants to take some outside object (in the LMS app's case a Canvas course) and create a group for it, it's going to be very useful if the integrator can take or generate some kind of ID for the outside object (any kind of digest, or UUID, etc) and use that as the ID for the Hypothesis group. This makes integrating a lot easier.

We don't want to make it more difficult for integrators by calling it a "name" and thereby encouraging them to think that it should be human readable and display friendly, might appear in URLs or in the UI, might want to be limited to a certain character set or be short, etc.

We want to encourage them to say "just generate an opaque ID or use any kind of existing ID you have".

We want to encourage integrators to use the kind of opaque and not human or particularly URL friendly strings that come out of standard universal ID and digest algorithms.
Groups already have names.

If we introduce a new field called groupname, then the group create and update APIs will have two separate fields in the body named name and groupname.

eLife and the LMS app are abusing usernames

A "username" is supposed to be a user (human)-chosen unique identifier that can be used in display or in URLs to publicly identify that user. eLife uses usernames that look like fs4ngbko. The LMS app uses a SHA-1 of the tool_consumer_instance_guid and user_id LTI launch params. These are opaque IDs, not usernames. They can't be used in URLs or displayed in the UI as usernames normally are in Hypothesis. The reason why both eLife and the LMS app have to stuff opaque IDs into the username field is that username is a required field in Hypothesis, and we haven't taken the refactoring time to make it optional. We're getting away with this because third-party users don't have URLs yet (user pages don't work for third-party users yet) and because non-unique display names are always enabled for third-party users.

Usernames aren't mutable

Lyza's gist (second revision part) says that usernames are mutable so groupnames -- since they're intended to be similar to usernames -- must be mutable too, and therefore a groupname-based upsert API isn't possible. The reasons why usernames can be changed (by sending a support request to us) don't apply to caller-provided opaque IDs for groups:

The desire for people to be able to change their username comes from that it's the visible-in-the-UI-and-URLs representation of them, so they might care what it looks like.
This isn't the case for the caller-provided group IDs needed for the LMS app and other third-party integrations. These are going to be SHA-1's, UUIDS, etc. It's just an opaque ID so you'd never care to change it.

Also, I don't really think usernames are mutable:

You can't change your username in your account settings
Authclients can't change the username using the user update API either (the username is in the URL, not the request body)
You can email us and we can use the admin pages to change your username but this isn't something the user or API client can do themselves
Changing a username is actually kind of a big deal. For example we need to reindex all their annotations, the URL of their user page changes, and a another person can now come along and claim the old username (and URL). Some of this data and URLs that once belong to one person but now don't work or belong to another person may be stored by others, for example in links to us, so there'll now be broken links out there or links that now go to someone other than they used to. Etc.
Enabling users to change their usernames themselves without asking us may not be something we ever want to implement, because of the potential confusion and abuse. If we do one day implement it, we may want to add it to the account settings page but not make it doable by the API.
In other systems I think it's typical to not be able to change your username at all, or to have to request a change, or changing it is available in the UI only (not API) and covered in "Are you sure?", "This cannot be undone" and "All sorts of potentially undesirable things could happen" warnings.

Changing a username would break eLife or the LMS app

If eLife of LMS users were able to change their usernames by any means this would break the app. For example the LMS app uses a SHA-1 of LTI data to compute the username and re-find the existing user account in h. If the username in h has changed since the LMS app created the account, it would fail to find it. Generally speaking IDs (user or group) that're provided by the integrator being mutable are hostile to the integrator -- if someone changes the ID the integration will break.

Group provider IDs are like user provider IDs

What the LMS app and other similar integrations need is to be able to take the unique ID of an outside object (a Canvas course in the LMS app's case), or generate one (a SHA-1 in the LMS app's case, but we want to be able to support any kind of digest or UUID within reason), and then create or update a Hypothesis group using that ID.

Hypothesis should simply treat the ID as an opaque string
It should be a unicode string of arbitrary length (with only a high maximum length restriction)
The ID should never be used in URLs
The ID should never be shown to users
The ID shouldn't be used as a secret key for joining the group, or have any other additional constraints
This should be a new Group.provider_unique_id field on models.Group
It should be optional. The LMS app and other similar integrations will make use of Group.provider_unique_id but there's no reason for other groups to have one.

The only difference between Group.provider_unique_id and provider_unique_ids for users is that users have zero-or-more provider unique IDs whereas groups have zero-or-one. This is why user provider-unique-ids are in a separate table (UserIdentity.provider_unique_id) whereas group ones just need to be a new column on the group table (Group.provider_unique_id). There's good reason for users to be able to have more than one ID -- it's to enable multiple logins / login integrations in future. But this isn't a concern for groups. (Also having more than one group provider ID might make designing update or upsert APIs for this much harder.)

Questions about REST

The ID doesn't need to be mutable, at least not from the LMS app's point of view currently (or likely ever). If mutability makes the API much harder for the LMS app and other integrators to use then it's actively undesirable from the integrators point of view. Maybe worth asking if we have an upsert API that the LMS app uses and that upsert API doesn't allow mutating the provider_unique_id, does that preclude us from also having an up_date_ API that does allow mutating it? The LMS app would have no need for such an API. But if mutability is desired is there anything preventing us from having the mutable restful API as well as the API that LMS app needs?

Does having a RESTful API preclude us from also having some other API endpoints that may not be strictly restful, in cases where they'll make the integrations we need to implement simpler and more robust?