Counting Lines of Code with GitHub Linguist and Bash

If you’re using GitHub Linguist to analyze your project’s code, you may want to take a closer look and see how many lines each individual file contains. GitHub Linguist provides a useful breakdown of the languages and files, but not an easy way to get line counts for each file. Let’s dive into how you can achieve this using a combination of GitHub Linguist, jq, and some classic Bash commands.

Why Count Lines of Code Per File?

Analyzing lines of code (LOC) per file is a useful metric to:

Assess Complexity: Identifying files with excessive lines of code might highlight areas that could benefit from refactoring.
Identify Hotspots: Knowing where most of the code resides helps focus efforts for documentation or optimization.
Track Growth: Monitoring LOC helps gauge how your project is evolving over time.

GitHub Linguist Breakdown

GitHub Linguist can give you a breakdown of your repository’s languages and the files for each language. Running:

github-linguist --breakdown

Provides an output like:

69.06%  62434      TypeScript
25.94%  23452      PLpgSQL
1.61%   1459       JavaScript
...

It also lists the filenames for each language. However, to dig deeper, we want to take these file paths and count the lines for each individual file.

The Script to Count Lines of Code

Here's the solution we’ll use to get the line count for each file listed by GitHub Linguist:

github-linguist --breakdown --json | jq -r '.[] | .files[]' | xargs wc -l 2>/dev/null | sort -n

Let’s break down what each part of this command does:

github-linguist --breakdown --json: This command generates a language breakdown in JSON format, which is easier to parse programmatically. It outputs something like this:

{
  "JavaScript": {
    "size": 1459,
    "percentage": "1.61",
    "files": [".prettierrc.js", "@app/client/eslint.config.mjs", ...]
  },
  ...
}

jq -r '.[] | .files[]': jq is a powerful command-line tool for parsing JSON. The command .[] | .files[] extracts all file paths from the JSON output.
- .[]: Selects each language object (e.g., JavaScript, TypeScript).
- .files[]: Extracts each filename in the files array for each language.
- -r: Outputs raw strings without quotes, which makes the list suitable for passing to other commands.
xargs wc -l: xargs takes the list of filenames outputted by jq and passes them to wc -l to count the number of lines for each file.
2>/dev/null: This part redirects error messages (such as permission denied errors or non-existent files) to /dev/null, keeping the output clean.
sort -n: Finally, sort -n sorts the results numerically by line count, so you can quickly identify which files have the most or least lines of code.

Why Use GitHub Linguist Instead of `find`?

While it’s common to use tools like find to get a list of files for line counting, GitHub Linguist provides a more sophisticated and focused approach.

If you set up your .gitattributes file properly, you can explicitly mark files as generated or not relevant for analysis:

@shared/sql/src/*.mts linguist-generated=true
@shared/sql/src/index.mts linguist-generated=false
@app/db/schema/supabase.sql linguist-generated=true

See more information about this file here.

This allows GitHub Linguist to automatically skip files that don’t reflect the core work done by developers, such as:

Generated files: Files that are automatically generated by tools and do not represent human effort.
External dependencies: Skipping folders like node_modules to avoid inflating your codebase statistics.

This gives a more accurate representation of the actual work that was done in the project, focusing only on meaningful contributions.

A Note on Lines of Code as a Metric

It’s essential to highlight that lines of code are not an indicator of a developer’s value or productivity. Not all lines of code are equal—some lines solve complex problems, while others might simply be configuration or boilerplate code. Counting lines of code can help identify areas that need attention, but it should never be used to judge contributors.

In fact, some of the most valuable contributions come in the form of removing unnecessary code, which may not show up positively if we only consider line counts.

Example Output

Running the above command might give you output like:

   15 .prettierrc.js
   45 @app/client/eslint.config.mjs
  120 @app/client/next.config.mjs
  200 @app/client/src/app/page.tsx
  ...

This gives a clear, sorted list of each file along with its respective line count.

Conclusion

By leveraging GitHub Linguist, jq, and simple Bash commands, you can gain deeper insights into your repository’s code at a file level while ignoring irrelevant or generated files. This helps you understand where complexity lies and provides a cleaner picture of your project's growth and structure.

However, it’s crucial to remember that lines of code are just one of many metrics. The quality, maintainability, and impact of code are far more valuable measures. Use these insights to improve your project, not as a tool to evaluate individual contributions.

With these considerations in mind, you can use GitHub Linguist as an effective tool to better understand your codebase while ensuring the metrics reflect actual work done.

amerryma/counting-lines-of-code-with-github-linguist.md