jrafanie · May 22, 2016 15:15
diff --git a/gistfile1.txt b/gistfile1.txt
 I have a belief that code duplication, while a useful metric, is often done
 because of code complexity. In other words, complicated code is more likely to
 be duplicated.  It doesn't mean that simple code can't be duplicated but that
 it's less likely and less of a problem.  Along with this thinking, I'm asserting
 that if you limit complexity increase spikes, you will slowly remove duplication
 with it.  Therefore, it might be less useful to target duplication except as a
 way to measure the impact of decreased complexity on duplication.

 With a large legacy codebase, it's nearly impossible to remove all complexity.
 One, it takes many resources many hours/days/weeks/months to do this. Two, you'd
 have to stop everything you're doing if you want to do this as it would be hard
 to make large changes all over the place while also trying to ship features.
 Three, with legacy codebases, you tend to have areas of churn, code that is
 frequently changed and other code that is infrequently or never changed.
 Changing code that is infrequently modified is a wasted effort because people
 aren't spending that much time dealing with the complexity of these files,
 classes, modules.

 With this in mind, we have tooling for ruby/javascript/etc. and git projects to
 try to improve the visibility of this problem.

 First, you need to calculate the time period you care about. Perhaps, the time
 period from the current release to the prior release, or maybe sprint boundaries:

 commit_range = release4_git_sha1...release5_git_sha1

 We do the last release or last N releases because we want to eliminate the noise
 of code we haven't been changing.

 Take this commit_range and divide it up into batches.

 Process each batch of commits:
 batch.commits.each do |commit|
  commit.files_changed.each do |file|
    calculate_net_lines_added_removed(file)
    increment_commit_count(file)

    # Runs flog/complexity before: git show sha1~:/path_to_file
    # Runs flog/complexity after:  git show sha1:/path_to_file
    calculate_net_complexity_added_removed(file)

    methods_changed.each do |method|
      increment_commit_count(method)
      calculate_net_complexity_added_removed(method)
    end

    class_modules_changed.each do |class_module|
      increment_commit_count(class_module)
      calculate_net_complexity_added_removed(class_module)
    end

    # track author for managers/leads to see who needs help refactoring,
    # pairing, etc.
    end
  end
 end

 Note, you'll have to deal with negative complexities and lines for people like
 me who prefer to remove lines of code.

 Compile all the batch results.

 Graph net complexity added per file, method, class/module, author
 Graph net complexity per line added per file, method, class/module, author
 Graph commit counts vs. complexity added per file, method, class/module

 Run this on each sprint, release, etc. boundary and try to find patterns.
 Learn from the results, teach and mentor where you have spikes of increased
 complexity.  Based on these results, you may now have new ways to measure
 pull requests to detect these problems before they're merged.
	I have a belief that code duplication, while a useful metric, is often done
	because of code complexity. In other words, complicated code is more likely to
	be duplicated. It doesn't mean that simple code can't be duplicated but that
	it's less likely and less of a problem. Along with this thinking, I'm asserting
	that if you limit complexity increase spikes, you will slowly remove duplication
	with it. Therefore, it might be less useful to target duplication except as a
	way to measure the impact of decreased complexity on duplication.

	With a large legacy codebase, it's nearly impossible to remove all complexity.
	One, it takes many resources many hours/days/weeks/months to do this. Two, you'd
	have to stop everything you're doing if you want to do this as it would be hard
	to make large changes all over the place while also trying to ship features.
	Three, with legacy codebases, you tend to have areas of churn, code that is
	frequently changed and other code that is infrequently or never changed.
	Changing code that is infrequently modified is a wasted effort because people
	aren't spending that much time dealing with the complexity of these files,
	classes, modules.

	With this in mind, we have tooling for ruby/javascript/etc. and git projects to
	try to improve the visibility of this problem.

	First, you need to calculate the time period you care about. Perhaps, the time
	period from the current release to the prior release, or maybe sprint boundaries:

	commit_range = release4_git_sha1...release5_git_sha1

	We do the last release or last N releases because we want to eliminate the noise
	of code we haven't been changing.

	Take this commit_range and divide it up into batches.

	Process each batch of commits:
	batch.commits.each do \|commit\|
	commit.files_changed.each do \|file\|
	calculate_net_lines_added_removed(file)
	increment_commit_count(file)

	# Runs flog/complexity before: git show sha1~:/path_to_file
	# Runs flog/complexity after: git show sha1:/path_to_file
	calculate_net_complexity_added_removed(file)

	methods_changed.each do \|method\|
	increment_commit_count(method)
	calculate_net_complexity_added_removed(method)
	end

	class_modules_changed.each do \|class_module\|
	increment_commit_count(class_module)
	calculate_net_complexity_added_removed(class_module)
	end

	# track author for managers/leads to see who needs help refactoring,
	# pairing, etc.
	end
	end
	end

	Note, you'll have to deal with negative complexities and lines for people like
	me who prefer to remove lines of code.

	Compile all the batch results.

	Graph net complexity added per file, method, class/module, author
	Graph net complexity per line added per file, method, class/module, author
	Graph commit counts vs. complexity added per file, method, class/module

	Run this on each sprint, release, etc. boundary and try to find patterns.
	Learn from the results, teach and mentor where you have spikes of increased
	complexity. Based on these results, you may now have new ways to measure
	pull requests to detect these problems before they're merged.