I'm trying to figure out how to use data to drive contributor ladder nomination / promotion (and possibly pruning)
Someone asked me if they could get promoted in an OWNERS file. I wanted to know, had they reviewed enough PRs relevant to that OWNERS file?
I can (sortof) now answer whether spiffxp
should be in cluster/OWNERS
because they're saying /lgtm
on relevant PRs
eg:
./last-100-merged-prs.py spiffxp --repo kubernetes/kubernetes --file-regex ^cluster/ --comment /lgtm
# ...
2019-02-28T16:41:41Z: https://github.com/kubernetes/kubernetes/pull/74731 - ignored: neither authored nor commented /lgtm
2019-03-02T04:34:54Z: https://github.com/kubernetes/kubernetes/pull/74808 - commented
2019-03-02T20:59:12Z: https://github.com/kubernetes/kubernetes/pull/74851 - commented
2019-03-05T17:50:13Z: https://github.com/kubernetes/kubernetes/pull/74854 - commented
2019-04-18T05:58:20Z: https://github.com/kubernetes/kubernetes/pull/76711 - ignored: neither authored nor commented /lgtm
2019-06-14T14:58:39Z: https://github.com/kubernetes/kubernetes/pull/78614 - commented
2019-06-20T13:54:50Z: https://github.com/kubernetes/kubernetes/pull/75638 - commented
2019-06-28T00:43:47Z: https://github.com/kubernetes/kubernetes/pull/79410 - commented
2019-06-28T01:53:47Z: https://github.com/kubernetes/kubernetes/pull/79407 - ignored: neither authored nor commented /lgtm
2019-06-28T19:43:33Z: https://github.com/kubernetes/kubernetes/pull/79390 - ignored: neither authored nor commented /lgtm
2019-07-02T22:41:12Z: https://github.com/kubernetes/kubernetes/pull/79284 - commented
2019-07-11T04:39:46Z: https://github.com/kubernetes/kubernetes/pull/79949 - commented
2019-07-12T05:03:05Z: https://github.com/kubernetes/kubernetes/pull/79554 - ignored: neither authored nor commented /lgtm
2019-07-12T05:03:18Z: https://github.com/kubernetes/kubernetes/pull/80046 - ignored: neither authored nor commented /lgtm
2019-07-13T22:37:04Z: https://github.com/kubernetes/kubernetes/pull/80054 - ignored: neither authored nor commented /lgtm
2019-07-14T16:55:05Z: https://github.com/kubernetes/kubernetes/pull/80141 - commented
2019-08-01T03:09:04Z: https://github.com/kubernetes/kubernetes/pull/80796 - commented
out of 100 most recently merged PRs in 'kubernetes/kubernetes' involving 'spiffxp', 17 touched files matching regex '^cluster/'
of those, spiffxp authored 0, and commented '/lgtm' on 10
This approach:
- asks github for the 100 most recent merged PRs that involve them (authored, commented, mentioned)
- gets the list of files that PR touches, as well as comments, reviews, review_comments and events for the PR
- determine whether the PR in question touches relevant files
- determines whether the person authored the PR in question
- determines whether the person has said '/lgtm' in comments, reviews, or review_comments
Problems:
- the 100 most recent PRs involving the person is noisy, and may drown out earlier relevant activity
- someone who is mentioned often on prs touching irrelevant files will appear more inactive than they should
- people who review a lot of PRs touching disparate paths will appear less recent
- this approach doesn't bother to evaluate quality of review, and so could fall prey to rubber-stampers
- people who actually review would have non-zero review_comment counts, drop /holds, etc.
- the approach of looking for
/lgtm
or/approve
ignores those who use github reviews to drive this- looking for a comment of '' works, but ignores the review state of github reviews (Comment/Request Changes/Approve)
Ideas:
- instead of "does this person" what about "who has reviewed the relevant files"
- on the other end of things "what files has this person actually reviewed"
- this is focused on a single-repo
- do I care about whether this person is reviewing when requested, when assigned, etc
- I may want to try the graphql api, so I can get files and comments directly
- I may want to walk back a list of PRs relevant to this dir
Scaling This Out:
-
what prior art exists that we could build on?
- devstats
- could we derive the files changed from commits
- could we add a pr<->files mapping table to help construct queries?
- gharchive
- GitHub doesn't expose PullRequestReview events (ref: igrigorik/gharchive.org#197)
- ghtorrent
- microsoft/ghcrawler
- all of our prow logs
- devstats
-
APPARENTLY GitHub's event streams (at least those consumed by gharchive and https://developer.github.com/v3/activity/events/#list-events-performed-by-a-user) do not contain the
PullRequestReviewEvent
type of event. Which is the event that happens when people use GitHub's "approve a pull request" UI -
Use microsoft/ghcrawler?
- Most of the stores are either mongo or azure-specific
- For grins I've pointed this at kubernetes/enhancements with two tokens, dumping into gs://spiffxp-ghcrawler, but it's not exactly easy to query
- My read of https://github.com/microsoft/ghcrawler/blob/develop/lib/visitorMap.js#L197-L210 is that this doesn't actually scrape file directly though
- May need to get at files through some level of indirection via commits
- At a glance this scrapes way more than I'm interested in, I think: stargazers, commits, individual comments
- microsoft/ghcrawler#112 says I should be able to define a scenario and visitor map to fetch only what I want (and maybe other stuff I also want?)
-
https://github.com/fhoffa/analyzing_github/
- points out two other bigquery datasets besides gharchive
- http://ghtorrent.org/gcloud.html
- https://medium.com/google-cloud/github-on-bigquery-analyze-all-the-code-b3576fd2b150