Last active
January 1, 2025 05:42
-
-
Save convexset/28abc9c7be261954507ac705cbf64099 to your computer and use it in GitHub Desktop.
Remove multiple text watermarks from a PDF file. Requires xxd and qpdf to work correctly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# Remove multiple text watermarks from a PDF file. Requires xxd and qpdf to work correctly. | |
# | |
# Usage: | |
# | |
# remove-pdf-watermark.sh "Your Input File.pdf" "Your Output File.pdf" [WATERMARK1] [WATERMARK2] [WATERMARK3] [...] | |
# | |
# For Example: | |
# | |
# remove-pdf-watermark.sh "Your Input File.pdf" "Your Output File.pdf" "Watermark 1" "Watermark 2" | |
# | |
# This is a more general (lesser dependencies, more functionality, but slower) version of | |
# https://gist.github.com/elfsternberg/a96883018d783cbbad7b454ecd0a7ffe | |
INPUT_FILENAME=$1 | |
OUTPUT_FILENAME=$2 | |
echo "Processing: $INPUT_FILENAME" | |
UNCOMPRESSED=`mktemp -t 'uncompressed'` | |
UNCOMPRESSED_HEX=`mktemp -t 'uncompressed-hex'` | |
UNMARKED_PRE=`mktemp -t 'unmarked-pre'` | |
qpdf --stream-data=uncompress --decode-level=all "$INPUT_FILENAME" $UNCOMPRESSED | |
echo " - Decompressing to: $UNCOMPRESSED" | |
xxd -ps -c 0 $UNCOMPRESSED > $UNCOMPRESSED_HEX | |
echo " - Hex dumping to: $UNCOMPRESSED_HEX" | |
rm $UNCOMPRESSED | |
for ARG in "$@"; do | |
COUNT=$((COUNT+1)) | |
if [[ $COUNT -gt 2 ]] | |
then | |
WATERMARK=$ARG | |
echo " - Processing Watermark: \"$WATERMARK\"" | |
WATERMARKLEN=${#WATERMARK} | |
WATERMARK_HEX=$(echo -n $WATERMARK | xxd -p -c 0) | |
BLANKS=$(printf %${WATERMARKLEN}s) | |
BLANKS_HEX=$(echo -n "$BLANKS" | xxd -p -c 0) | |
NUM_OCCURENCES_=$(grep -o $WATERMARK_HEX $UNCOMPRESSED_HEX | wc -l) | |
NUM_OCCURENCES=$((0+$NUM_OCCURENCES_)) | |
echo " * Number of occurences: $NUM_OCCURENCES" | |
if [[ $NUM_OCCURENCES -gt 0 ]] | |
then | |
echo " * Replacing with $WATERMARKLEN blanks..." | |
sed -i -e "s/$WATERMARK_HEX/$BLANKS_HEX/g" $UNCOMPRESSED_HEX | |
echo " * Replacement done." | |
NUM_OCCURENCES_=$(grep -o $WATERMARK_HEX $UNCOMPRESSED_HEX | wc -l) | |
NUM_OCCURENCES=$((0+$NUM_OCCURENCES_)) | |
echo " * Final number of occurences: $NUM_OCCURENCES" | |
fi | |
fi | |
done | |
xxd -r -p -c 0 $UNCOMPRESSED_HEX $UNMARKED_PRE | |
rm $UNCOMPRESSED_HEX | |
echo " - Reverting from hex dump to: $UNMARKED_PRE" | |
qpdf --stream-data=compress $UNMARKED_PRE "$OUTPUT_FILENAME" | |
rm $UNMARKED_PRE | |
echo " - Output written to: $OUTPUT_FILENAME" | |
echo | |
echo | |
# NO WARRANTY | |
# | |
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS | |
# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF | |
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | |
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE | |
# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION | |
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION | |
# WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
I think that sounds really really cool. Having structured data to work with allows so much more. Also, thanks for qpdf
. It has saved me so much time.
Here's one json example which works for some pdfs
qpdf --json --json-stream-data=file in.pdf out.json
for i in `grep -Hi confidential out.json-* | cut -d':' -f1`; do
cp /dev/null $i;
done
qpdf --json-input out.json out.pdf
@dhewg I think this is the first example I've seen of user-contributed qpdf json. :-)
And such a small yet powerful one! ;) Some watermarks are just too much in-your-face...
Thanks for the json format, this makes it easy to manipulate with standard tools!
Fails with:
mktemp: too few X's in template ‘uncompressed’
mktemp: too few X's in template ‘uncompressed-hex’
mktemp: too few X's in template ‘unmarked-pre’
qpdf: an output file name is required; use - for standard output
For help:
qpdf --help=usage usage information
qpdf --help=topic help on a topic
qpdf --help=--option help on an option
qpdf --help general help and a topic list
- Decompressing to:
./pdf_remove_watermark.sh: line 29: $UNCOMPRESSED_HEX: ambiguous redirect
- Hex dumping to:
rm: missing operand
Try 'rm --help' for more information.
Running on:
qpdf v11.9.1-1
coreutils v9.4-3.1
Fails with:
mktemp: too few X's in template ‘uncompressed’ mktemp: too few X's in template ‘uncompressed-hex’ mktemp: too few X's in template ‘unmarked-pre’ qpdf: an output file name is required; use - for standard output For help: qpdf --help=usage usage information qpdf --help=topic help on a topic qpdf --help=--option help on an option qpdf --help general help and a topic list - Decompressing to: ./pdf_remove_watermark.sh: line 29: $UNCOMPRESSED_HEX: ambiguous redirect - Hex dumping to: rm: missing operand Try 'rm --help' for more information.
Running on:
qpdf v11.9.1-1 coreutils v9.4-3.1
Me too. I changed the mktemp's to this:
UNCOMPRESSED=`mktemp 'uncompressed-XXXXXXXXXX.pdf'`
UNCOMPRESSED_HEX=`mktemp 'uncompressed-hex-XXXXXXXXXX.pdf'`
UNMARKED_PRE=`mktemp 'unmarked-pre-XXXXXXXXXX.pdf'`
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It will be interesting to see whether/how qpdf json v2 might help you with this kind of task once qpdf 11 is out. qpdf json v2 is in main now, though the interface isn't frozen until I release. If you want to take a look, you can check out current main. Documentation is at https://qpdf.readthedocs.io/en/latest/json.html