-
-
Save matthieuheitz/7287e214b1aeda7948f6c27fbfb5288b to your computer and use it in GitHub Desktop.
#!/bin/bash | |
# Method found here https://askubuntu.com/a/122604/423332 | |
# Dependencies: | |
# On ubuntu, you can install ocrodjvu and pdfbeads with: | |
# sudo apt install ocrodjvu | |
# gem install pdfbeads | |
# The path and filename given can only contain ascii characters | |
f=$1 | |
# Get filename | |
filename=$(basename -- "$f") | |
extension="${filename##*.}" | |
file_no_ext="${filename%.*}" | |
# Count number of pages | |
echo "f=$f" | |
p=$(djvused -e n "$f") | |
echo -e "The document contains $p pages.\n" | |
# Number of digits | |
pp=${#p} | |
echo "###############################" | |
echo "### Extracting page by page ###" | |
echo "###############################" | |
# For each page, extract the text, and the image | |
for i in $( seq 1 $p) | |
do | |
ii=$(printf %0${pp}d $i) | |
djvu2hocr -p $i "$f" | sed 's/ocrx/ocr/g' > pg$ii.html | |
ddjvu -format=tiff -page=$i "$f" pg$ii.tiff | |
done | |
echo "" | |
echo "##############################" | |
echo "### Building the final pdf ###" | |
echo "##############################" | |
# Build the final pdf | |
pdfbeads > "$file_no_ext".pdf | |
echo "" | |
echo "Done" | |
# Remove temp files | |
echo "" | |
read -p "Do you want to delete temp files ? (pg*.html, pg*.tiff, pg*.bg.jpg) " -n 1 -r | |
echo # (optional) move to a new line | |
if [[ $REPLY =~ ^[Yy]$ ]] | |
then | |
rm pg*.html pg*.tiff pg*.bg.jpg | |
fi | |
I packaged pdfbeads
(with patches) to work on Debian without warnings (including the RMagick
vs. rmagic
thing) and with all the dependencies set to be pulled in. It should work on a sufficiently new Ubuntu version (I don't know how much, since I don't follow Ubuntu releases that closely). That being said, if I introduce that package on Debian, then getting it to work on Ubuntu should be relatively simple.
I can TRY TO provide a precompiled version of it on a PPA that I have (where I have other tools that I find useful).
In the mean time, the unfinished (but working) package is at: https://github.com/rbrito/pkg-pdfbeads
It works very well for me and I will try this script to see how well things go when we mix everything together.
Used it just now and everything worked perfectly. Only quirk was I had to roll back gem update --system 3.0.8
to get rmagick
to install properly and stop complaining that constant Gem::ConfigMap is deprecated
(issue and fix discussed here).
> gem list rmagick iconv pdfbeads
*** LOCAL GEMS ***
rmagick (4.2.5, 2.16.0)
iconv (1.0.8)
pdfbeads (1.1.3)
Thanks so much for this helpful script!!
I get :