Created
November 25, 2023 21:44
-
-
Save warpfork/7cd00e59309cc059f4f27be48505750a to your computer and use it in GitHub Desktop.
A bash Zappifier. (Not the latest version, due to... computer ownership shenanigans.)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
set -euo pipefail | |
#set -x | |
## HOW TO HOLD IT: | |
## | |
## Give the program you want to zapp as the first argument. | |
## | |
## For the couple of different usage patterns: | |
## - If you want to use directory shard conventions, set SPLAY_BASE to something sensible. | |
## - If you want to use file shard conventions, set SPLAY_BASE="-". | |
## - If you want just the whole files in output, do neither of the above -- it's the default. | |
## | |
## If you want to bundle multiple programs into the same bin dir, | |
## just set OUT_DIR to the same thing and run the whole script repeatedly. | |
## | |
## If you want to set OUT_DIR *and* SPLAY_BASE | |
## | |
## Most of these other variables you only set if you have to. | |
## (We'll try to find an "ldshim" on your path -- or you can just tell us where it is. | |
## We'll assume you have the usual host ELF interp. | |
## Stuff like that.) | |
target_program="${1?"must provide target program as first argument"}" | |
ZAPPIT_ELFINTERP="${ZAPPIT_ELFINTERP:-"/usr/lib64/ld-linux-x86-64.so.2"}" | |
OUT_DIR="${OUT_DIR:-"/tmp/zappme/app/$(basename "$target_program")"}" | |
SHIM_BIN="${SHIM_BIN:-"$(which ldshim)"}" | |
SPLAY_BASE=${SPLAY_BASE:-""} ## Can be a pattern. | |
# Spicy: the elf interp segfaults on a static binary. Maybe that's bad. Geesh. | |
readarray -t liblines < <(LD_TRACE_LOADED_OBJECTS=1 "$ZAPPIT_ELFINTERP" "$target_program") | |
declare -A libs | |
for line in "${liblines[@]}"; do | |
#echo | |
#echo "${line@Q}" | |
## This regexp hits on a couple points: | |
## - The output lines always start with a tab. | |
## - The library name is whatever comes before a "=>". | |
## - ... except when it's an in-memory only library; then the "=>" doesn't appear at all. | |
## For our purposes: we just don't match that line. We don't need it anyway. | |
## - There's a file path after that. It's always absolute. | |
## - A memory address comes in parens at the end. | |
## We don't need this, so we don't report it, but we do match on it, just to be exhaustive. | |
## | |
## Is that enough? | |
## | |
## Well, I don't know. It depends on how loosey-goosey your ELF interp is | |
## about handling any library names or filesystem paths that have wonky characters in them. | |
## (It looks to me like it's typical for there to be no escaping at all on the file path part, | |
## which is why our regexp is so complete about handling the full line.) | |
## (If the path to the library contains a linebreak, things are truly impossible to control. | |
## This is unfortunate, because it's possible. I see no way to address this in our script; | |
## the ELF interp would have to do some escaping or validation of its output, and it... doesn't.) | |
## (I haven't tested at all what happens if a library name has madness in it. | |
## I suspect it's unhandled in the typical ELF interp as well, and thus uncorrectable here.) | |
## | |
## I've seen other parsers of this data outright assume no spaces are present in any names, | |
## and we've managed to do a bit better than that. But ultimately we're parsing a format | |
## that has no escaping that is sufficient for the range of data it's willing to pass through, | |
## and there's simply no way to secure that. The regexp isn't the problem; the data is. | |
## | |
## So! Moving along... as best we can... | |
if [[ $line =~ ^$'\t'([^\0/]+)\ =\>\ (/[^\0]+)\ \([x0-9a-f]+\)$ ]]; then | |
#declare -p BASH_REMATCH | |
libs["${BASH_REMATCH[1]}"]="${BASH_REMATCH[2]}" | |
fi | |
## One more interesting bit of that regexp: we don't match if any slashes are in the library name. | |
## As far as I know, the only time a slash appears in the library name is itself a special case: | |
## it's when the ELF interp reports *itself*: it does this with a full path, including a leading slash. | |
## Excluding that from our consideration is generally correct for our purposes here. | |
## TODO: if something uses the mad/unsafe ORIGIN format for the ELF interp itself, I wonder if that shows up here instead. | |
done | |
echo --- | |
declare -p libs | |
## N.B., Right now, this tool is only focused on dynamic libraries | |
## (e.g. `.so` files). We detected those above by use of ELF headers. | |
## | |
## In the future, there's no reason the attention targets can't grow: | |
## both explicit inputs, or files detected by strace, could be attended to; | |
## and the output dirs of those might be "data" instead of "lib". | |
## There's two angles of approach to getting content addressed paths. | |
## | |
## - We can say that packaging controls this, and things "should" already be | |
## in directories, where the path name contains the hashes; | |
## - Or, we can say "screw it" and apply a hash ourselves and declare that's | |
## how it's going to be now. | |
## | |
## If we're building tools that are meant to live in a hostile world, and bring | |
## things into the fold the instant we encounter them, wherever they're coming from: | |
## then the second approach is the more powerful, because it demands nothing in advance, | |
## and also leaves nothing to be decided (the hashes are file granularity: done). | |
## The downside is: the only well-defined choice is hashes of file granularity; | |
## and that means we've pretty much removed any way to preserve any concepts of | |
## organizational intention. (Whether that matters for shared libs is debatable!) | |
## | |
## The approach of asking for directories to already be content-addressed is... | |
## well, it's asking for more. However, it also provides more. | |
## Specifically, if there's multiple library files from the same package, | |
## they get to stay as siblings on the filesystem (with each other, and perhaps, | |
## although it is rare for this to matter, also staying adjacent to other data files). | |
## Also, if you have a system managing packages... it gets to manage packages, | |
## as opposed to managing individual files -- and the former is a bit smaller of a number. | |
## | |
## Hybridizations are possible. You can *always* construct the file-scale CA index | |
## for a heap of libraries you already have available. | |
## For auto-detecting content-address-friendly paths, we do the following: | |
## | |
## Step 1: resolve (all) symlinks and clean the path. | |
## | |
## Step 2: look for our configured sharding path hunk. | |
## This can be at any depth in the path. | |
## It's a very simple match. It's just the string literal (there's no | |
## imaginable call for globbing or patterns here), anywhere within the path | |
## (as long as it's not the last two segments -- the last is the filename, | |
## and the second-to-last should be the content-addressing mangle-named dir.) | |
## (There's no more checks than that we can really do: We don't assume | |
## anything about the format of the content-addressing mangle that's | |
## presumed to be the subsequent path segment, so we've got nothing to check | |
## there; and the directory depth inside that dir can be arbitrarily high.) | |
## | |
## Step 3: for any path where we did find shard-looking paths, | |
## TODO FINISH | |
## Set up the directory skeleton for our packaged output. | |
mkdir -p -- "$OUT_DIR" | |
mkdir -p -- "$OUT_DIR/bin" | |
mkdir -p -- "$OUT_DIR/dynbin" | |
mkdir -p -- "$OUT_DIR/lib" | |
## In all cases, we copy the binary to the dynbin folder, and the shim to the bin folder: | |
cp -- "$SHIM_BIN" "$OUT_DIR/bin/$(basename "$target_program")" | |
cp -- "$target_program" "$OUT_DIR/dynbin/$(basename "$target_program")" | |
case "$SPLAY_BASE" in | |
"") | |
>&2 printf "copying library files...\n" | |
for lib in "${!libs[@]}"; do | |
## For the simplest story, where we're just copying files entirely: | |
cp -- "${libs["$lib"]}" "$OUT_DIR/lib/$lib" | |
done | |
;; | |
"-") | |
>&2 printf "creating file-hash splay of library files...\n" | |
## TODO | |
;; | |
*) | |
>&2 printf "searching for splay patterns in paths to library files...\n" | |
## TODO | |
;; | |
esac | |
echo "----" | |
find "$OUT_DIR" | |
echo "----" | |
echo "size by parts:" | |
du -sh "$OUT_DIR"/* | |
echo "----" | |
echo "size in total:" | |
du -sh "$OUT_DIR" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment