Skip to content

Instantly share code, notes, and snippets.

@rrbutani
Last active November 1, 2025 07:22
Show Gist options
  • Select an option

  • Save rrbutani/99cf60a44626b926c727a757f7dc3f06 to your computer and use it in GitHub Desktop.

Select an option

Save rrbutani/99cf60a44626b926c727a757f7dc3f06 to your computer and use it in GitHub Desktop.

what

tests the we can "project" particular files out of a TreeArtifact for consumption in downstream rules

the intent is that by doing this projection:

  • downstream rules that operate on files (not directories — i.e. not TreeArtifact aware) can consume our artifact
  • sensitivity in downstream targets is narrowed to only the files in the TreeArtifact that are projected out

Note

this example also tests out interactions w/path mapping

how is this different than ...

rules_directory/skylib's directory rules:

bazel-lib's directory_path:

  • directory_path does ingest TreeArtifacts but produces DirectoryPathInfo — a tuple of a TreeArtifact and a relative path within it
  • consumers (i.e. downstream rules) must be "aware" of DirectoryPathInfo in order to handle directory_path inputs correctly
  • to expose a DirectoryPathInfo as a file, bazel-lib has copy_file
    • this is more or less what the rules in this example do with one caveat: allow_symlnk is not allowed with DirectoryPathInfo (ctx.files.src is empty)
    • this means that you have to fall back to creating an actual copy...

why is this interesting?

this scheme (narrowing down a TreeArtifact to specific files cheaply via symlinks), coupled with ECO, allows you to get incrementality in your graph even in the face of monolithic actions that you cannot break up further

as an example:

  • say that you have, some monolithic code generator that spits out a bunch of source and header files into a directory
    • say that splitting up this code generator is intractable
    • also, let's say that this code generator produces an intermediate number of header files — determinisitic but depends on the inputs in a way that's hard to model in analysis (not just a function of the names/number of input files, etc.)
  • say that we do know the names of the source files
  • we would like to model compilation of the source files such that we get incrementality — if only one of the source files (and none of the header files) changes, we'd like to only recompile that file
    • note that the code generator's action is still being rerun
    • narrowing what the downstream compile actions see as input allows ECO to keep us from having to rerun compile actions corresponding to source files that did not change

this use case is (imo) not that unrealistic...

re: do we even need to do this narrowing manually?

  • for starlark rules: definitely; we do not (yet? 🤞) have access to buck2 style dynamic actions
  • for "native" Bazel rules: still yes; afaik even the builtin rules do not support TreeArtifact sources (java, for example)

If you know the contents of the TreeArtifact a-priori such that you're able to project them like this, why model the action w/TreeArtifact at all?

couple of reasons:

  • sometimes you know some of the eventual outputs but not all of them... having the TreeArtifact is useful for capturing the "other" outputs
  • for this scheme you actually don't exactly have to know the outputs of interest all the way up in analysis!
    • to expose things in the TreeArtifact as regular files you certainly need to know the number of files of interest but exact file names can be something you figure out dynamically
    • DirectoryExpander + map_each + allow_closure are a powerful escape hatch; they let you do fairly non-trivial things in that "after analysis but right before my action is executed" grey area

misc

Also see (previously): "Bazel dynamic input subsetting with TreeArtifacts"

This gist is essentially using the same idea as ^ (narrow a TreeArtifact using symlinks for better incrementality) but with a couple of important differences:

  • in ^, we're going from a known set of files to a TreeArtifact (subset that's not known until execution time)
  • in this gist, we're going from a TreeArtifact to a subset of files (that's ~known1 during analysis)
  • in ^ we were doing the TreeArtifact business within the confines of one rule
  • in this gist we're explicitly trying to expose the symlinks to other rules such that they can be used without having to be aware of our directory/directory narrowing scheme (i.e. unlike bazel-skylib's DirectoryInfo)

Caution

the materialized symlink in bazel-out is different for local execution vs RBE... (uh-oh?)

for things not in the bazel execroot it's the same when materialized...

but for things that are in the execroot, running locally materializes as a relative symlink, running remotely materializes as an abs path symlink to the file

EDIT: nevermind! this is just BWoB at play; with download_toplevel heuristics are used to uncover that the artifact is a symlink to another artifact (and that it thus needs to materialize the underlying artifact); this alternate materialization codepath is responsible for the discrepancy I think

this should probably still be fixed though... haven't verified for sure but I think this discrepancy might be able to influence downstream actions?

  • yeah, the contents in the execroot actually are different...

Footnotes

  1. DirectoryExpander escape hatch applies

common --experimental_output_paths=strip
common --disk_cache=./disk_cache
# common --remote_download_minimal
common --remote_download_toplevel
use nix -p bazel_7
/MODULE.bazel.lock
/disk_cache
/bazel-*
/.direnv
/.vscode
load(":defs.bzl", "make_directory", "project", "consumer")
genrule(
name = "gen",
srcs = ["defs.bzl"],
outs = ["defs", "defs2"],
cmd = "tr -d '\n' < $< > $(location :defs); cp $< $(location :defs2)",
)
genrule(
name = "number",
srcs = [".num"],
outs = ["num"],
cmd = "cat $< > $@",
)
# NOTE: To test path mapping:
# - build `gen` for `-c opt` and `-c dbg` (genrule is not path mapped)
# - build `dir` for `-c opt` and then `-c dbg`
# * the second build should hit the disk cache
make_directory(
name = "dir",
srcs = {
"//:BUILD.bazel": "out/build",
"//:MODULE.bazel": "out/mod",
"//:..bazel-tree-artifact-projection.md": "out/read",
"//:defs.bzl": "out/gen",
"//:defs": "gen-squish",
"//:num": "out/foo/bar/num",
}
)
################################################################################
# NOTE: this action is not path mapped (and that's okay); see `defs.bzl`.
project(
name = "num_from_dir",
dir = ":dir",
rel_path = "out/foo/bar/num",
# NOTE: if ^ is a dir, bazel crashes (without our check in `map_each`)...
)
################################################################################
# `consumer` has an artifical delay (5 sec) to make it apparent when its being
# rebuilt
# NOTE: verify path mapping by building this in multiple configurations
consumer(
name = "_num_chk",
input = ":num",
)
# NOTE: to test "narrowing":
# - build once
# - change an unrelated input to `dir` (i.e. this file)
# - build again — should not be rebuilt
# - change `.num`
# - build again — *should* be rebuilt
consumer(
name = "num_via_dir",
input = ":num_from_dir",
)
# NOTE: even though `project` is not path mapped, `consumer` is.
#
# For `num_via_dir`, Bazel stages `num_from_dir` (in the action's sandbox base)
# as a symlink to `num_from_dir` in the execroot.
#
# This `num_from_dir` (in the execroot) is the symlink that `project` created
# (i.e. points at dir via a relative path). Because this relative path is
# resolved in the context of the execroot (not the sandbox base) it is _fine_
# that it contains a configuration hash.
#
# Technically this is an impurity — if the action were to walk this symlink
# chain, it could see the underlying configuration hash — but in general Bazel
# considers this kind of thing outside of the scope of its hermeticity
# guarantees. Actions that follow symlinks and inspect their paths are able to
# witness other kinds of impurities even without path mapping; i.e. the
# execroot's abs path on disk, the absolute path destination of staged artifacts
# in the sandbox.
# NOTE: to test BWoB w/`--remote_download_minimal` or `--remote_download_toplevel`:
# - change `.num`
# - build `num_from_dir` once (note: disk cache is enabled in `.bazelrc`)
# - do `bazel clean --expunge`
# - build `num_via_dir`
#
# Bazel should recognize that it needs `num_from_dir` to build `num_via_dir and
# that `num_via_dir` requires (due to the symlink) `dir`.
#
# You can also see this by looking at `bazel-bin/dir`: only `out/foo/bar/num`
# will be present; Bazel will not have fetched all of the artifacts in `dir`;
# only `num` was required.
#
# If you build `dir`, the rest of the artifacts will be fetched from the cache
# as well (note that you will need `--remote_download_toplevel`, not minimal).
# TODO: does this scheme work with RBE?
# - the question really is whether symlinks are staged the same way and
# whether symlinks in action outputs that point to another action's outputs
# are understood (and result in that other output being fetched and such)
#
# I think the answer is yes but: have not tested.
################################################################################
# With `bazel-lib`, cannot do the equivalent; cannot create a symlink into a
# `TreeArtifact` (need to create a copy).
load("@aspect_bazel_lib//lib:directory_path.bzl", "directory_path")
directory_path(
name = "MODULE_via_dir",
directory = ":dir",
path = "out/mod",
)
load("@aspect_bazel_lib//lib:copy_file.bzl", "copy_file")
copy_file(
name = "_MODULE_copy",
src = ":MODULE_via_dir",
out = "MODULE_copy",
allow_symlink = True, # causes an analysis failure
)
################################################################################
# Kind of orthogonal; testing the checks done on
# `declare_file`/`declare_directory` outputs and how those checks treat
# symlinks:
#
# (also testing `run_shell` vs `actions.symlink(...)`)
load(":defs.bzl", "sym", "incorrect_dir_output")
# source file symlink (source file -> file)
sym(
name = "sym1",
src = ":.num",
)
# source dir symlink (source dir -> file)
#
# NOTE: both `ln -s` and `actions.symlink()` fail post-execution: source is a
# directory, output is a file
#
# NOTE: the source dir is not represented as a `TreeArtifact` in the rules API
sym(
name = "sym2",
src = ":foo", # NOTE: we are warned that this is unsound (source dir input)
)
# generated file symlink (gen file -> file)
sym(
name = "sym3",
src = ":num",
)
# TreeArtifact symlink (gen dir -> file)
#
# NOTE: `actions.symlink()` fails in analysis — output is a file but target is
# a directory
#
# NOTE: `ln -s` action fails in post-execution; Bazel *crashes*
# - symlinks into *directories* within a `TreeArtifact` (when a file output
# was declared) are an untested edge case I guess..
# - FIXME? TODO?
#
# NOTE: changing the `ln -s` output to be a directory does make this succeed
# - note that you need to delete the directory first though; Bazel makes it
# for you
sym(
name = "sym4",
src = ":num",
)
# NOTE: declared `TreeArtifact` that's actually a symlink to a file is not
# permitted (post execution error)
# NOTE: declared `TreeArtifact` that's actually a file is also not permitted
# NOTE: action producing a directory for a file output results in a post-exec
# error:
incorrect_dir_output(name = "incorrect")
# NOTE: dangling symlinks appear to be disallowed as well (unless you use
# `declare_symlink`)
# (all as you'd hope/expect)
def _make_directory_implementation(ctx):
out = ctx.actions.declare_directory(ctx.attr.name)
file_map = {} # dict[File, str]
for tgt, rel_dest in ctx.attr.srcs.items():
files = tgt[DefaultInfo].files.to_list()
if len(files) != 1:
fail(
"All targets in source must have exactly one file", tgt, "has",
len(files)
)
file_map[files[0]] = rel_dest
cmd = [
'set -euo pipefail',
'OUT="${1}"',
'mkdir -p "$OUT"',
"shift",
'while [[ $# -gt 1 ]]; do',
'src="${1}"; dst="$OUT/${2}"',
'echo "$src -> $dst"',
'shift; shift',
'mkdir -p "$(dirname "$dst")"',
'cp "$src" "$dst"',
'done',
"sleep 0.3",
]
ctx.actions.run_shell(
outputs = [out],
inputs = file_map.keys(),
command = "\n".join(cmd),
arguments = [(
ctx.actions.args()
.add_all([out], expand_directories=False)
.add_all(
file_map.items(),
map_each=lambda tup: [tup[0].path, tup[1]],
allow_closure=True,
)
)],
execution_requirements = {
"supports-path-mapping": "1"
}
)
return [DefaultInfo(files=depset([out]))]
make_directory = rule(
implementation = _make_directory_implementation,
attrs = {
"srcs": attr.label_keyed_string_dict(
allow_empty = False, allow_files = True, mandatory = True
),
},
executable = False,
test = False,
)
################################################################################
# TODO: test w/path mapping (is the symlink rewritten?)
def _project_impl(ctx):
# NOTE: don't want `declare_symlink` — we want Bazel to follow the symlink
# when considering whether something needs to be rebuilt
out = ctx.actions.declare_file(ctx.attr.name)
# NOTE: not allowed (file pointing at a `TreeArtifact`); `out` must be
# declared as a directory which we don't want
# ctx.actions.symlink(output=out, target_file=ctx.file.dir)
# NOTE: not allowed — `out` must be declared as a symlink which we don't
# want
# ctx.actions.symlink(output=out, target_path=ctx.file.dir.path)
# Do the symlinking manually?
# Without path mapping, this is fine:
path_map = True
if not path_map:
ctx.actions.run_shell(
outputs = [out],
inputs = [ctx.file.dir],
# NOTE: the naïve `ln -sr` approach doesn't work because `-r`
# appears to canonicalize the symlink target path before
# relativizing it, resulting a broken symlink once `out` is moved
# out of the sandbox area.
# ---
# command = 'ln -sr "$@"',
# arguments = [
# ctx.file.dir.path + "/" + ctx.attr.rel_path,
# out.path,
# ],
# Instead: prepend `../`s to the symlink target path to make it
# "relative" to the symlink we're creating:
command = 'ln -s "$@"',
arguments = [
# TODO: use skylib for path manip
(
("../" * out.path.count("/")) +
ctx.file.dir.path + "/" + ctx.attr.rel_path
),
out.path,
]
)
else:
# NOTE: We can rewrite the above in path mapping style (`Args.add_all`,
# deferred use of `path`, etc.) but: it will fail if path mapping is
# actually enabled.
#
# We are creating a symlink to a (relative) path that exists during this
# action's execution but does *not* exist in the real execroot; Bazel
# will (rightfully) complain that the resulting symlink is dangling when
# it moves this action's output to the execroot.
#
# Note that in some cases we can work around this by further
# relativizing the symlink target path — if `dir` and `out` have the
# same root, a relative path that doesn't escape the root will do the
# job and will resolve both in the path mapped sandbox base and in the
# actual execroot.
#
# But in the general case, no such workaround is possible.
#
# NOTE: This is okay though; this action does not really need to be
# path mapped. It is cheap (just a symlink) and this action *not* being
# path mapped does not have negative effects on downstream actions that
# consume the projected artifact (see `BUILD.bazel`).
rel = ctx.attr.rel_path
def link_args(tup, dir_expander):
dir_inp, out_file = tup
files_in_dir = dir_expander.expand(dir_inp)
for file in files_in_dir:
if file.tree_relative_path == rel:
# see above:
return [
# TODO: use skylib for path manip
"../" * out.path.count("/") + file.path,
out.path
]
fail(rel, "not found in", dir_inp, "found:", [f.tree_relative_path for f in files_in_dir])
ctx.actions.run_shell(
outputs = [out],
inputs = [ctx.file.dir],
command = 'ln -sv "$@"',
arguments = [(
ctx.actions.args()
.add_all([(ctx.file.dir, out)], map_each=link_args, allow_closure=True)
)],
execution_requirements = {
# "supports-path-mapping": "1" # see comment above
}
)
return [DefaultInfo(files=depset([out]))]
project = rule(
implementation = _project_impl,
attrs = {
"dir": attr.label(
allow_single_file = True,
),
"rel_path": attr.string(mandatory = True),
},
executable = False,
test = False,
)
################################################################################
def _consumer_implementation(ctx):
out = ctx.actions.declare_file(ctx.attr.name)
ctx.actions.run_shell(
outputs = [out],
inputs = [ctx.file.input],
command = 'md5sum $1 > $2 && sleep 5',
arguments = [(
ctx.actions.args()
.add(ctx.file.input)
.add(out)
)],
execution_requirements = {
"supports-path-mapping": "1"
}
)
return [DefaultInfo(files=depset([out]))]
consumer = rule(
implementation = _consumer_implementation,
attrs = {
"input": attr.label(
allow_single_file = True,
),
},
executable = False,
test = False,
doc = "dummy rule to test path mapping/make it apparent if cached/rebuilt",
)
################################################################################
# symlink testing
def _sym_implementation(ctx):
outs = []
# out1 = ctx.actions.declare_directory(ctx.attr.name + ".1")
out1 = ctx.actions.declare_file(ctx.attr.name + ".1")
ctx.actions.run_shell(
outputs = [out1],
inputs = [ctx.file.src],
command = 'rm -r "$2" && touch "$2"',
# TODO: use skylib for path manip
arguments = [
("../" * out1.path.count("/")) + ctx.file.src.path,
out1.path,
]
)
outs.append(out1)
out2 = ctx.actions.declare_file(ctx.attr.name + ".2")
ctx.actions.symlink(output=out2, target_file=ctx.file.src)
outs.append(out2)
return [DefaultInfo(files=depset(outs))]
sym = rule(
implementation = _sym_implementation,
attrs = {
"src": attr.label(
allow_single_file = True,
),
},
executable = False,
test = False,
)
# dir for file output test:
def _incorrect_dir_output_implementation(ctx):
out = ctx.actions.declare_file(ctx.attr.name)
ctx.actions.run_shell(
outputs = [out],
inputs = [],
command = "mkdir $1",
arguments = [ctx.actions.args().add(out)],
)
return [DefaultInfo(files=depset([out]))]
incorrect_dir_output = rule(
implementation = _incorrect_dir_output_implementation,
attrs = {
},
executable = False,
test = False,
)
module(name = "tree-artifact-projection-playground")
bazel_dep(name = "aspect_bazel_lib", version = "2.17.1")
@rrbutani
Copy link
Author

rrbutani commented Jun 8, 2025

Also see (previously): "Bazel dynamic input subsetting with TreeArtifacts"

This gist is essentially using the same idea as ^ (narrow a TreeArtifact using symlinks for better incrementality) but with a couple of important differences:

  • in ^, we're going from a known set of files to a TreeArtifact (subset that's not known until execution time)
  • in this gist, we're going from a TreeArtifact to a subset of files (that's ~known1 during analysis)
  • in ^ we were doing the TreeArtifact business within the confines of one rule
  • in this gist we're explicitly trying to expose the symlinks to other rules such that they can be used without having to be aware of our directory/directory narrowing scheme (i.e. unlike bazel-skylib's DirectoryInfo)

Footnotes

  1. DirectoryExpander escape hatch applies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment