Announcement: Parallel Disk Usage (pdu) — A highly parallelized, blazing fast disk usage visualizer

Here is the link to the git repository in case you desire none of my ramblings: https://github.com/KSXGitHub/parallel-disk-usage.

About dust

In the past few months, I have always used dust to visualize disk usages of heavy directories. It displays an intuitive bottom-up tree from heavier items to lighter ones. Every item is attached with a percentage bar that allow me to compare the relative size 2 sibling items as well as their parent. I quite like it.

I would soon discover its limits however.

Functionality limitation

Sometimes I want to compare 2 files relatively, thus I type the following command:

dust /bin/ls /bin/docker

And this is what dust (v0.5.4) gives:

 140K ┌── ls    │██████████████████████████████████████████████████████████ │ 100%
  51M ┌── docker│██████████████████████████████████████████████████████████ │ 100%

Both ls and docker have the exact same percentage and bar length. This is not useful.

Performance limitation

Unlike the above limitation which was discovered during my usage of dust, this one was discovered when I was skimming dust's code in an attempt of open source contribution (in order to amend the above limitation). I've noticed that although dust has crossbeam-channel in its Cargo.toml file, it is not used in the codebase, at least not in the way that I know of.

(pdu later proves to be faster than dust, so I guess my assumption of crossbeam-channel not being used is true)

Obstacle that prevents me from contributing to dust

As I have already mentioned, I tried to contribute to dust to add the functionality I desired, but I encountered a few obstacles:

The integration tests fail when I run it on my machine.
I don't understand the way to code is structured.

Furthermore, even if I overcome the aforementioned obstacles, I would still have to wait for my pull request to be merged, and a new version to be released. This is not to mention the performance limitation which cannot be resolved without significantly refactor the whole codebase.

The making of Parallel Disk Usage (pdu)

Naming

So I decided that I would create my own disk usage visualizer. It was named "dirt" initially. Then I realize the name has no relation to the actual functionality. Not only that, the name "dirt" was picked after "dust", which means that my tool shall forever lives in the shadow of dust. This fact does not sit well before my unbridled vanity and my astronomical arrogance. Besides, "dirt" sounds kinda derpy. So after some thinking I decided to go with "Parallel Disk Usage" (I would have taken "pdu" if not for that fact that the name was occupied on crates.io).

Implementation

I picture a directory tree as a nested tree (obviously!). A directory may contain files and subdirectories which in turn contain other subdirectories. Disk usage can as such be summarized from children to parent. This is such a perfect use case for rayon. The disk usage data is also a tree, because I need to visualize it.

The results

I finally have the functionality that I desired

❯ pdu /bin/ls /bin/docker --min-ratio=0
142K   ┌──ls    │                                                            │  0%
 54M   ├──docker│████████████████████████████████████████████████████████████│100%
 54M ┌─┴(total) │████████████████████████████████████████████████████████████│100%

(The above figure compares the miniscule size of ls to the supermassive black hole that is docker thereby demonstrating the immoral inefficiency of Go as opposed to the immoral efficiency of C)

As you can see, the graph above is far more useful than that of dust.

Bonus

❯ pdu /bin/{yay,paru} --min-ratio=0
 7M   ┌──paru │                                  ████████████████████████████│ 45%
 8M   ├──yay  │                            ██████████████████████████████████│ 55%
15M ┌─┴(total)│██████████████████████████████████████████████████████████████│100%

Moral zero-cost abstraction with generics: 1

Immoral mediocre #lolnogeneric garbage collector: 0

And it's fast

This is a benchmark sample of pdu (v0.0.0) against dust (v0.5.4), dutree (v0.12.5), and du (measured by GitHub CI after deployment):

(there's more)

As you can see, pdu easily beat both dust and dutree by a large margin. This does not surprise me since, on my machine (Arch Linux btw), pdu's debug build already beats dust by a small margin.

What surprises me, however, is that pdu also beats du by a small amount, despite pdu's release build loses to du on my machine.

And it's extensible

The pdu binary itself is not extensible. What is extensible is the crate. parallel-disk-usage is both a binary crate and a library crate. One may use the library to build one's own pdu's alternative with extra functionalities.

Finally, you may go to the GitHub repository to read more about pdu. You may also sponsor me via Patreon because whilst I might be skilled in many domains such as programming, trash talking, bolstering my own vanity and arrogance, etc. making money was not one of them.

KSXGitHub/2021-05-28-announce-parallel-disk-usage-pdu.blog.md