brynary/straight_as.md Secret

Created February 17, 2014 14:36

Star (5) You must be signed in to star a gist
Fork (3) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/brynary/21369b5892525e1bd102.js"></script>
Save brynary/21369b5892525e1bd102 to your computer and use it in GitHub Desktop.

Why You Should Never Try For All A's on Code Climate

Raw

Sometimes, when people use Code Climate, they try to make changes in response to all of the issues it reports, in order to achieve straight A's and/or a 4.0 GPA. We do not recommend using Code Climate in this way. This is an attempt to articulate why.

Today, Code Climate primarily reports smells. Smells are "symptoms of your code that possibly indicate a deeper problem." They are not necessarily problems themselves. Only a human programmer can decide if a piece of code is maintainable, because it's humans who have to maintain code.

The best analogy I've heard, is to use code metrics the way a doctor uses vital signs. They are not a diagnosis, but they can help make one in some cases. Other times the most appropriate course of action is to "do nothing" even though a vital sign may be abnormal. Like a doctor: first, do no harm.

In addition to the smells themselves, Code Climate aggregates an A-F rating for each class, as well as a GPA for each repo (which is simply the ratings weighted by lines of code). We recommend interpreting the A-F ratings as such:

A's and B's are good
Treat C's as a caution flag.
Avoid (and sometimes fix, depending on the context) D's and F's

The insight here is that while individual smells are sometimes not issues that need addressing, in aggregate they are pretty good indicators. Most, most people feel like they'd have trouble maintaining code in classes/files that Code Climate scores as D's or F's. On the other hand, most people feel like they do not have trouble maintaining code scored as A's and B's.

What is a good Code Climate score?

For a large app, over a year old, under active maintenance a score of 3.5 or better is great.

Note: Overall Code Climate averages skew higher than that, because we host a lot of small projects (e.g RubyGems). Smaller projects tend to be more maintainable and also have higher Code Climate GPAs.

What does Code Climate score on Code Climate?

Our main app scores a 3.2. Our "worker" scores a 3.0.

If A's and B's are both fine, why have B's at all?

Good question. Maybe we should get rid of them. They are primarily there because an A-F scale felt most understandable, and it includes a B between A and C.

Why does a small change to the code sometimes cause a grade to change?

We call this "the camel problem". As in, "the straw that broke the camel's back". Code Climate rescores the entire class every time it updates, so the size of a grade or GPA change is not connected to the size of the change made.

It is very common for bad code to accumulate through lots of small, individually justifiable changes. At some point Code Climate throws a flag out. In those cases, it is not a reflection on the particular change that was made, but an overall warning about the area of code itself. We recommend taking a step back and evaluating the situation holistically in these instances.

Why not improve Code Climate to report only smells I care about?

In cases where the algorithm can be changed to be clearly more accurate, we will do that. (Although these updates take a fair amount of time for us to roll out carefully.) An example of this would be penalization of Symbol#to_proc in Ruby. This was never particularly intended, and this is now a popular Ruby idiom (one we adhere to ourselves). The penalty for it is vestigial.

Other cases are less clear. For example, Code Climate's tendency to flag long methods is too sensitive for some (and too generous for others). The core problem is that there is a tension between providing early warnings about developing maintainability issues and detecting only issues worth taking action on.

If we make Code Climate more strict, it will report more things that do not, in the judgement of humans, require action to be taken. On the other hand, if we make it less strict (for example so it only reports BIG problems that almost certainly require action), we won't be providing information until it's too late. Code Climate today can help you avoid ever introducing BIG problems because it will start to warn you early (by marking something as a C, for example).

The current system is a reflection of the balance between these priorities.

Why not let me override Code Climate in specific cases to tell it a piece of code is "as intended"?

Good question. We may end up doing this. However, enabling someone to manipulate their Code Climate scores is both complex as well as risky.

For example, one of our most important use cases is within a team. In those contexts, you have a mix of experiences. In that case, if one programmer were to mark an issue as "wontfix" (and the GPA went up as a result), that issue would be hidden from other people on the team. This would impair the ability of others on the team to use Code Climate to review the code in question (because it would have been changed to report all A's).

Also, when hiring a new developer, they would not be to as easily explore Code Climate and learn about the app.

Note: Interestingly, Issues reported by our Rails Security Monitor are treated differently. For security issues, there is generally a deterministic answer (agreeable across teams and in general) as to whether it is real. So in those cases, we do provide a way for someone to mark something as a false positive if it is not a real vulnerability.

tpitale commented Feb 17, 2014

I really like the Q&A sections. They provide a lot of great information. However, I'm not sure how they relate to the original premise of why you shouldn't aim to have a perfect 4.0 GPA.

In that vein, I think you could speak more about how code metrics of all kinds (coverage is a common example) are not intended to be made perfect. That's a waste of time, there's no business value, it may actually lead to less maintainable or more brittle code, etc.

You could also have more of a section about what to do with the D/F graded stuff. Those are places to focus more. Start from the bottom, the big problems. More importantly, if you have to work in a D/F file/class on a feature, that is the time to work to improve it.

As for the Q&A, you could maybe make that into a whole distinct article. :-)

chrismdp commented Feb 17, 2014

I think it's a great question about the B's. If there is ever any danger in Code Climate, is that the overzealous respond to each commit by blame the committer for any decrease.

I think a discussion on developing the messaging within CC to be more guideline focused would be interesting. Perhaps you could discuss the merits of a continuous spectrum as opposed to discrete grades? There's also the option of using Green (A/B), Amber (C), Red (D-F)...

(An aside: comment support within Code Climate would be interesting, too, so that a discussion can be generated as a response to a smell. Possibly linking into Github comments.)

Author

brynary commented Feb 17, 2014

Thanks, Tony and Chris. All good feedback.

RE: comments, that is on our roadmap for sure.

bcardarella commented Feb 17, 2014

Instead of A - F why not just "Good" "Ok" and "Bad"... or something to that effect. It feels with the letter grading system you are forced to provide something that is "B" quality

gerhard commented Feb 17, 2014

I feel the whole A-F scale to be a bit impractical, a percentage would make more sense. Everyone can relate to 100%:

green	amber	red
> 80%	60% - 80%	< 60%
carry on	take care	act now

You could apply this scale to both classes and the application as a whole. As long as the app is green you're good, otherwise you have work to do.

Author

brynary commented Feb 17, 2014

Brian -- There are a few reasons:

"Good", "OK", "Bad" feel like normative judgements. While A's-F's can be interpreted like this, but with education we are trying to help that situation. On the other hand, you are most certainly going to have code which scores as "good" which is not good, and code that scores as "bad" which is not bad (for humans).
Losing granularity means that ratings don't change as much, and it's harder to effect change. For example, when people move a class from F -> D, that is often a BIG win. Our current system lets teams celebrate that.
The A-F ratings are easily digestible into a GPA. A more abstract system could certainly be digested into a GPA, but there is value in having that algorithm be easily understood (rating weighted by LOC).

Author

brynary commented Feb 17, 2014

@gerhard -- RE: percentages. My feeling is that this would just change the problem from "I need all A's" to "All my code needs to score 100%", which is actually worse.

sferik commented Feb 17, 2014

My overall impression is that this reads like a long, thoughtfully-considered excuse.

Only a human programmer can decide if a piece of code is maintainable, because it’s humans who have to maintain code.

I believe Code Climate could be refined to the point where its recommendations would align very closely with human judgement. Maybe not 100% of the time but some percentage approaching 100. Maybe the upper limit is 95%, maybe it’s 92%, maybe it’s 81.375%. To me, this essentially says, “Since we’re pretty sure we’ll never reach 100%, we’re not gonna bother trying to reach that upper limit.” 😞

The best analogy I've heard, is to use code metrics the way a doctor uses vital signs.

Let’s say a doctor measures your blood pressure to be 130/85. This reading is not indicative of hypertension (which starts around 140/90) but it does indicate prehypertension. The prescription for prehypertension is typically a change in lifestyle and diet: do more exercise, eat less salt, drink less alcohol. There’s just one problem: the sphygmomanometer was broken. Your blood pressure is actually normal (110/70). There’s no medical reason for you to do more exercise, eat less salt, or drink less alcohol (at least as far as hypertension is concerned). Following this recommendation may even cause harm (e.g. an iodine deficiency as a result of reducing salt intake). In my opinion, this analogy is closer to the current state of Code Climate’s instruments. They are broken, they are giving false positive reading, and they need to be fixed. As you said: first, do no harm.

Sometimes, when people use Code Climate, they try to make changes in response to all of the issues it reports, in order to achieve straight A's and/or a 4.0 GPA. We do not recommend using Code Climate in this way. This is an attempt to articulate why.

This is crazy town (and you’re the sheriff). The whole point of the American-style grading systems is to incentivize students to strive for the highest grades. What’s the point of copying this system and then telling people not to strive for a 4.0 GPA?

Author

brynary commented Feb 17, 2014

Replying to @sferik...

I believe Code Climate could be refined to the point where its recommendations would align very closely with human judgement. Maybe not 100% of the time but some percentage approaching 100. Maybe the upper limit is 95%, maybe it’s 92%, maybe it’s 81.375%.

In my experience operating Code Climate for a couple years now, I feel like the upper bound here is lower than that. As simple examples, human judgement differs widely. Different programmers have different approaches to things. We hear all the time from developers with different feelings about what size methods are most understandable to them.

Projects also have a wide variety of context. For example, what was once "OK" for Code Climate (when the codebase was smaller and it was just me working on it) is now not OK. I suppose you could try to take all of this into account (the # of developers, project size, age, etc.) but you'd end up with an algorithm that is extremely complicated and that no one can understand (think Google).

There is significant value in having algorithms people understand. This gives them the best ability to use it effectively. So one constraint in terms of the analysis we do, is we try to always ensure the developer can understand why the scores and issues are the way there are.

To me, this essentially says, “Since we’re pretty sure we’ll never reach 100%, we’re not gonna bother trying to reach that upper limit.” 😞

This is certainly not how we feel about it.

This is crazy town (and you’re the sheriff). The whole point of the American-style grading systems is to incentivize students to strive for the highest grades. What’s the point of copying this system and then telling people not to strive for a 4.0 GPA?

The reason we used a standardized grading system (instead of, for example, numeric) is it's more clear and easy to understand, in more cases. We ruled out numeric systems because the biggest piece of feedback with code metrics tools we go when researching Code Climate was that they spit a bunch of numbers at you and you had to make sense of it all.

So the current system, combined with education, is our best attempt to find a happy medium. There are other options, for sure (and good/ok/bad came up a few times in this thread), but they all have tradeoffs. To date, this feels like the best option we have considered.

sferik commented Feb 17, 2014

@brynary Thanks for replying.

This is certainly not how we feel about it.

I know. That’s just how it comes across. I know you guys care a lot. 😄

etagwerker commented Feb 17, 2014

I agree with @chrismdp on the coloring solution.

Author

brynary commented Feb 17, 2014

Worth noting. B's are green already today. :)

sferik commented Feb 17, 2014

@brynary light green

sferik commented Feb 17, 2014

One more thought: in my opinion, the “Churn vs. Quality” scatterplot is the most useful feature on Code Climate. Basically, it tells me whether I need to refactor and where I should start (the upper right). Why is this chart tucked away on the last page? This is much more useful than the “Classes by Rating” donut chart on the home page, especially if you don’t want people to focus too much on a 4.0 GPA.

loren commented Feb 17, 2014

The article as a whole makes good sense and it's clear you have weighed the pros/cons of the grading system thoughtfully, but I think the title of the piece is causing some trouble. Perhaps it should be "Why You Should Never Try For Anything at All on Code Climate". I should try to write clear/maintainable code, but I have my bad days and CC nudges me when something smells. It's up to my team to decide what policy to create around that signal (e.g., if you commit code that lowers the GPA, make sure you have a reason). Maybe we enjoy maintaining that 30-line block of Sunspot DSL magic.

No matter what metric you assign to code (a letter grade, a smiley face, a color, a floating point number), someone will strive to raise that score on the next commit unless it's clear to them what they ought to be doing instead.

Changes I would make to this piece:

Rip out Q&A and stick it somewhere else (echoing @tpitale).
Adjust the rest of the piece to focus on a few healthy ways people ought to use CC in their flow, versus the one way they shouldn't be using it. Title accordingly.

danielmorrison commented Feb 17, 2014

I think 4.0 is a tough, impractical goal for an existing project, but if you haven't been writing tests from the beginning, so is reaching 100% coverage. If you didn't have any tests, then you'd be pretty happy getting up to 10%.

That said, we have a few real, production apps with 4.0-3.95 scores, and others with ~3.5. The reason these are so high is that we started them with CodeClimate. Once you start with 4.0, you start to be critical of code that takes your score down. Sometimes staying at 4.0 isn't worth it, but it often doesn't take much work.

So I'd say improving your score, at any stage, is the goal. If your codebase comes in at 2.0, that's ok, just work on getting it up toward 2.5. Even that will be hard on a big legacy codebase. But for new projects, you should have higher expectations.

Author

brynary commented Feb 18, 2014

@loren and @danielmorrison -- Thanks for all of that feedback. Agreed on all accounts.

brycesenz commented Feb 18, 2014

I care less about the particular rating system that Code Climate uses, and more about just whether or not it's helping me make good decisions about how to prioritize which parts of my code to restructure/refactor. I wish I had the luxury of tackling "B" classes, but I don't since there's almost always something more pressing.

I don't know if you're going to be able to come up with a single metric that can capture all of the reasons why code might need addressing. Honestly, I don't even know if you should. A good example would be test coverage - "A" code with zero test coverage and "F" code with 100% test coverage both need addressing, but for very different reasons. A meta-metric that ranks these examples as both Cs is going to make the situation less clear, not more.

In my mind, there's more to be gained by adding complementary tools (e.g UML diagrams) that broaden one's perspective on his/her code than in trying to optimize what is a useful-if-imperfect scoring system.

lightcap commented Feb 18, 2014

I think it's a great bit of work. Well done. I love that you're thinking so critically about the side effects and consequences of the way that people are using Code Climate.

For what it's worth my team and I use Code Climate almost entirely to watch trends. For old projects or rescue projects we use it to ensure that there is a consistent improvement in the metics, while for new projects it helps us keep code quality high from the beginning. And, really, it's all about thinking critically about your code.
So, the way we use Code Climate, I'd much rather see more emphasis on trending and less on the current snapshot of the code. Context is crucial, focusing on the trend rather than the current score may help people use the product more effectively.

Thanks for posting. (And thanks for Code Climate, too).

leenasn commented Feb 18, 2014

Thanks for sharing this @brynary, its really useful. Especially the explanation about interpretation about the ratings and your recommendation on what a good Code Climate should be.

Can you also write on how to use/interpret the test coverage report too? That would be really useful.

brynary/straight_as.md Secret

What is a good Code Climate score?

What does Code Climate score on Code Climate?

If A's and B's are both fine, why have B's at all?

Why does a small change to the code sometimes cause a grade to change?

Why not improve Code Climate to report only smells I care about?

Why not let me override Code Climate in specific cases to tell it a piece of code is "as intended"?

tpitale commented Feb 17, 2014

Uh oh!

chrismdp commented Feb 17, 2014

Uh oh!

brynary commented Feb 17, 2014

Uh oh!

bcardarella commented Feb 17, 2014

Uh oh!

gerhard commented Feb 17, 2014

Uh oh!

brynary commented Feb 17, 2014

Uh oh!

brynary commented Feb 17, 2014

Uh oh!

sferik commented Feb 17, 2014

Uh oh!

brynary commented Feb 17, 2014

Uh oh!

sferik commented Feb 17, 2014

Uh oh!

etagwerker commented Feb 17, 2014

Uh oh!

brynary commented Feb 17, 2014

Uh oh!

sferik commented Feb 17, 2014

Uh oh!

sferik commented Feb 17, 2014

Uh oh!

loren commented Feb 17, 2014

Uh oh!

danielmorrison commented Feb 17, 2014

Uh oh!

brynary commented Feb 18, 2014

Uh oh!

brycesenz commented Feb 18, 2014

Uh oh!

lightcap commented Feb 18, 2014

Uh oh!

leenasn commented Feb 18, 2014

Uh oh!