Some notes on the SE-EDU development process

Table of Contents

1. Contribution process overview
2. Why?
- 2.1. The problem space
- 2.2. Goals of the contribution process
3. How?
4. The "single logical change per commit" rule
5. Commit messages
6. Commits from other projects

1. Contribution process overview

Create a PR
When ready, submit a new iteration via CanIHasReview.
Wait/respond to review comments.
Make changes, go back to step 2, repeat indefinitely.

… Pretty similar to most projects

What makes SE-EDU’s development process different compared to other GitHub projects you may have come across:

Clear signaling of PR status via CanIHasReview
Rebase workflow / "Commit organization" (Every commit is a single change)
Commit messages

2. Why?

2.1. The problem space

Given A PR, reviewer decides "merge" or "don’t merge".
How should a reviewer decide?
- Let’s think of some criteria:
- Correctness of the code according to its specification
  - E.g. Function gives the correct output for an input
  - Question: What is the specification?
- Does the PR fix the bug?
  - What does "fixing the bug" mean?
  - Is the bug just a sign of a much deeper problem?
- Does the PR enhance the product?
  - Is it truly an enhancement?
  - Is it aligned with the project goals?

Given just the diff of the PR (and nothing else, not even communication with the PR author), can you decide if the PR should be merged?

2.2. Goals of the contribution process

Provide as much information as possible to the reviewer so that the reviewer can make a correct "merge" or "no-merge" decision, while being able to precisely identify the reasons why the decision should be made.
While being as efficient as possible.

3. How?

The "single logical change per commit" rule
Commit messages

4. The "single logical change per commit" rule

"Each commit contains a single logical change, and this change must stand on its own."
"Each commit has a single responsibility, and the responsibility must be fully carried out.

Sounds familiar? This rule is inspired by two ideas in programming: the Single Responsibility Principle and Levels of Abstraction.

4.1. Single Responsibility Principle

(via Wikipedia)

The single responsibility principle is a computer programming principle that states that every module, class, or function should have responsibility over a single part of the functionality provided by the software, and that responsibility should be entirely encapsulated by the class. All its services should be narrowly aligned with that responsibility. Robert C. Martin expresses the principle as, "A class should have only one reason to change," although, because of confusion around the word "reason" he more recently stated "This principle is about people."

4.2. Levels of Abstraction

Abstraction: The process of hiding details to more closely attend to other details
Levels/Layers of Abstraction: Each level represents a different model of the same information, but with varying amounts of detail.

Why do we have abstraction?

Makes it easy to reason about the program, as well as to communicate with others.
Likewise, we can apply this concept to PRs as well, so as to make it easier for contributors and reviewers to reason about changes in the PR, and to communicate with each other about details in the PR.

4.3. What is a "Single Logical Change"?

Logical

The logical framework(s) used by the project. (e.g. for software code: a refactoring)

Single change

A single entity when viewer from a certain level of abstraction.

Which level of abstraction is suitable? Depends on who you ask, but do note that we only have space for two levels of abstraction — the PR level and the commit level.
Try to choose the level of abstraction that makes it easy for you and the reviewer to reason about the change!

VERY LIKELY NOT:

Changes made to a single file
A single diff hunk
Why: The level of abstraction is usually too low for reviewers to easily reason about the diff.
Also, Git already provides this information, so choosing them as the level of abstraction won’t add any value.

Strategies for detecting "single responsibilities"

Look for patterns in the changes you are making to the code. Changes matching the same pattern will likely belong to the same responsibility.
Try to give a "one liner summary" of what your changes are doing, without using "and" (or its equivalents).
Also, consider that each commit must not break the build.

4.4. Why the "single logical change" rule?

Prevent diffs from different changes from mixing together.: Makes it easier to review since reviewers won’t need to spend extra brain cycles separating the changes into their different logical components while reading the diff.
Prevent unnecessary changes from being made.: When the "single logical change" is identified and stated in the commit message, any unnecessary change in the diff will stick out like a sore thumb.
Makes discussing changes easier.: Sometimes, a single logical change could stretch across different files and diff hunks. Without separating them into different commits, we can only refer to them vaguely with prose or at the granularity of a diff hunk or file.
Easier manipulation of the PR.: Commits in the PR can be easily split into separate PRs or joined together into a single PR as needed. (Depending on project requirements.)

4.5. Comparison: A "Squash and Merge" Workflow

"Squash and Merge" workflow: Very popular with GitHub projects nowadays.
"Single logical change per PR"?
- Less flexible since we only have one level of abstraction to work with.
- Each PR requires a separate branch, too heavyweight, discourages contributors from actually separating changes.
- "My commits would all be squashed into a single one anyway, so why bother?"

For SE-EDU development, you have the freedom to rearrange/rewrite your commits. Try playing around to get a feel for how to effectively organize your commits.

4.6. Language of commits

Splitting a PR into commits, each with a clearly defined responsibility, gives you a language to reason about and communicate changes at a higher level.

… we should probably drop [3/6] because …
… perhaps [4/8] should come before [2/8] …
… compared to [v2 4/6], the implementation used in [v3 5/7] seems less clean because …

4.7. Striking a balance

Try to strike a balance between a level of abstraction that is too low and a level of abstraction that is too high.

Don’t be afraid of submitting PRs with only one or two commits.
- Most PRs will only have one commit.
- You can always split your single commits into multiple commits when it becomes clear that it should be done.

If the PR gets too big…

Why is the PR too big?
- Does the PR contain commits which are irrelevant to the overall goal?
  - Split them into another PR.
- Is the goal of the PR defined too broadly?
  - Split into smaller PRs, each with smaller scope.

BUT!

PRs that are split out must be valuable on their own.
- Not convincing: This PR #1 doesn’t really add much value when it is merged, but when PR #2 is merged everything would be so much more awesome! (If so, why not put them together?)
- You never truly know whether a PR/commit is needed in accomplishing a high level goal until you actually accomplish that high level goal.
Don’t be afraid of submitting PRs that are too "big" as well.
- The "single logical change per commit" rule must be followed, no matter how many commits it ends up producing!
- We have CanIHasReview to help us with generating interdiffs, so we can handle big PRs with ease.

Basically, don’t sweat the details, trust your intuition. As long as you follow the "single logical change per commit" rule, you are on the right track, no matter how big or small the PR ends up being.

If you want to build and refine your intuition, just practice :-)

5. Commit messages

5.1. Back to first principles — The goal of commit messages

To inform.
- What is the problem.
- What is the single responsibility of the commit. (You did decide that, right?)
- What is the intent and expected result of the change.
- Related background information.
More importantly, to persuade.
- Give the reviewer/maintainer confidence that the change is a good thing for the project.
- Don’t just describe the change, argue for the change!

5.2. Commit messages vs Issue/PR descriptions

Commit messages are not special! Any technique you already use to make your issue/PR description a good one applies to commit messages as well.

Don’t just follow the commit message template. Ask yourself (or your friends) whether the commit message you wrote is persuasive.

So, why not just use issues/PR descriptions?

To support the "one logical change per commit" rule

If the rule is followed, each commit would have a different topic, and thus a different description. Using commit messages would thus allow us to tie the description together with the commit, as opposed to squashing them all together into the issue/PR description.

To leave an audit trail

E.g. for git bisect, searching through Git history

For extra convenience

Since we have spent so much time crafting the issue/PR description, might as well bundle it with the commit itself as well.
GitHub allows us to view all commit messages at one go when browsing commits, can’t do that with issue/PR descriptions.
Supported by more tooling: IntelliJ will show the commit message when hovering over commit hashes, git bisect will print the commit message of any offending commit etc.

Commit messages are part of the SE-EDU review process to ensure they are kept up to date.

5.3. Reviewing commit messages

Don’t just look at the style, whether it follows the commit message template etc.!
Check whether arguments made are sound.
Check the understanding of contributors
- If the contributor has thorough understanding of the technicalities behind the change, it is more likely that the change is correct.
- If the contributor has also considered other alternatives, it is even more likely that the change would be beneficial for the project.
Allows reviewers to check that the interests of the contributors are aligned with the project.

Signs that the change may be a bad idea: Hand waving, imprecise language, glossing over details, factual errors, bold claims without substantiation, invalid or unsound arguments.

6. Commits from other projects

Believe it or not, SE-EDU’s contribution process isn’t really novel at all. Here are some projects which do nearly the same things that we do, and some examples of commits from their history.

6.1. Git

https://github.com/git/git/commit/ed15e58efefc7f9dc1281860289e48e8f9c9b8fa

daemon: fix length computation in newline stripping

When git-daemon gets a pktline request, we strip off any
trailing newline, replacing it with a NUL. Clients prior to
5ad312bede (in git v1.4.0) would send:

  git-upload-pack repo.git\n

and we need to strip it off to understand their request.
After 5ad312bede, we send the host attribute but no newline,
like:

  git-upload-pack repo.git\0host=example.com\0

Both of these are parsed correctly by git-daemon. But if
some client were to combine the two:

  git-upload-pack repo.git\n\0host=example.com\0

we don't parse it correctly. The problem is that we use the
"len" variable to record the position of the NUL separator,
but then decrement it when we strip the newline. So we start
with:

  git-upload-pack repo.git\n\0host=example.com\0
                             ^-- len

and end up with:

  git-upload-pack repo.git\0\0host=example.com\0
                           ^-- len

This is arguably correct, since "len" tells us the length of
the initial string, but we don't actually use it for that.
What we do use it for is finding the offset of the extended
attributes; they used to be at len+1, but are now at len+2.

We can solve that by just leaving "len" where it is. We
don't have to care about the length of the shortened string,
since we just treat it like a C string.

No version of Git ever produced such a string, but it seems
like the daemon code meant to handle this case (and it seems
like a reasonable thing for somebody to do in a 3rd-party
implementation).

Reported-by: Michael Haggerty <[email protected]>
Signed-off-by: Jeff King <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>

6.2. Linux

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f076ef44a44d02ed91543f820c14c2c7dff53716

rtc: rk808: Compensate for Rockchip calendar deviation on November 31st

In A.D. 1582 Pope Gregory XIII found that the existing Julian calendar
insufficiently represented reality, and changed the rules about
calculating leap years to account for this. Similarly, in A.D. 2013
Rockchip hardware engineers found that the new Gregorian calendar still
contained flaws, and that the month of November should be counted up to
31 days instead. Unfortunately it takes a long time for calendar changes
to gain widespread adoption, and just like more than 300 years went by
before the last Protestant nation implemented Greg's proposal, we will
have to wait a while until all religions and operating system kernels
acknowledge the inherent advantages of the Rockchip system. Until then
we need to translate dates read from (and written to) Rockchip hardware
back to the Gregorian format.

This patch works by defining Jan 1st, 2016 as the arbitrary anchor date
on which Rockchip and Gregorian calendars are in sync. From that we can
translate arbitrary later dates back and forth by counting the number
of November/December transitons since the anchor date to determine the
offset between the calendars. We choose this method (rather than trying
to regularly "correct" the date stored in hardware) since it's the only
way to ensure perfect time-keeping even if the system may be shut down
for an unknown number of years. The drawback is that other software
reading the same hardware (e.g. mainboard firmware) must use the same
translation convention (including the same anchor date) to be able to
read and write correct timestamps from/to the RTC.

Signed-off-by: Julius Werner <[email protected]>
Reviewed-by: Douglas Anderson <[email protected]>
Signed-off-by: Alexandre Belloni <[email protected]>

6.3. PostgreSQL

https://github.com/postgres/postgres/commit/ab5fcf2b04f9cc4ecccb1832faabadb047087d23

Fix plan created for inherited UPDATE/DELETE with all tables excluded.

In the case where inheritance_planner() finds that every table has
been excluded by constraints, it thought it could get away with
making a plan consisting of just a dummy Result node.  While certainly
there's no updating or deleting to be done, this had two user-visible
problems: the plan did not report the correct set of output columns
when a RETURNING clause was present, and if there were any
statement-level triggers that should be fired, it didn't fire them.

Hence, rather than only generating the dummy Result, we need to
stick a valid ModifyTable node on top, which requires a tad more
effort here.

It's been broken this way for as long as inheritance_planner() has
known about deleting excluded subplans at all (cf commit 635d42e9c),
so back-patch to all supported branches.

Amit Langote and Tom Lane, per a report from Petr Fedorov.

Discussion: https://postgr.es/m/[email protected]

pyokagan/SeEduDevProcessNotes.adoc