-
-
Save matthieuheitz/7287e214b1aeda7948f6c27fbfb5288b to your computer and use it in GitHub Desktop.
#!/bin/bash | |
# Method found here https://askubuntu.com/a/122604/423332 | |
# Dependencies: | |
# On ubuntu, you can install ocrodjvu and pdfbeads with: | |
# sudo apt install ocrodjvu | |
# gem install pdfbeads | |
# The path and filename given can only contain ascii characters | |
f=$1 | |
# Get filename | |
filename=$(basename -- "$f") | |
extension="${filename##*.}" | |
file_no_ext="${filename%.*}" | |
# Count number of pages | |
echo "f=$f" | |
p=$(djvused -e n "$f") | |
echo -e "The document contains $p pages.\n" | |
# Number of digits | |
pp=${#p} | |
echo "###############################" | |
echo "### Extracting page by page ###" | |
echo "###############################" | |
# For each page, extract the text, and the image | |
for i in $( seq 1 $p) | |
do | |
ii=$(printf %0${pp}d $i) | |
djvu2hocr -p $i "$f" | sed 's/ocrx/ocr/g' > pg$ii.html | |
ddjvu -format=tiff -page=$i "$f" pg$ii.tiff | |
done | |
echo "" | |
echo "##############################" | |
echo "### Building the final pdf ###" | |
echo "##############################" | |
# Build the final pdf | |
pdfbeads > "$file_no_ext".pdf | |
echo "" | |
echo "Done" | |
# Remove temp files | |
echo "" | |
read -p "Do you want to delete temp files ? (pg*.html, pg*.tiff, pg*.bg.jpg) " -n 1 -r | |
echo # (optional) move to a new line | |
if [[ $REPLY =~ ^[Yy]$ ]] | |
then | |
rm pg*.html pg*.tiff pg*.bg.jpg | |
fi | |
FYI, the script worked almost perfectly[1] on my Ubuntu 18.04 box with the following package installation commands.
$ sudo apt-get install ruby-dev ruby-rmagick ocrodjvu
$ sudo gem install pdfbeads iconv
$ gem list rmagick iconv pdfbeads
rmagick (2.16.0)
iconv (1.0.8)
pdfbeads (1.1.1)
[1] The output pdf contains "W: `require 'RMagick'` is deprecated, please change to `require 'rmagick'`"
at the beginning because pdfbeads contains require 'RMagick'
. tail -n +2 output.pdf > fixed.pdf
is necessary to delete the line.
I get :
$ gem list rmagick
*** LOCAL GEMS ***
rmagick (4.0.0)
I packaged pdfbeads
(with patches) to work on Debian without warnings (including the RMagick
vs. rmagic
thing) and with all the dependencies set to be pulled in. It should work on a sufficiently new Ubuntu version (I don't know how much, since I don't follow Ubuntu releases that closely). That being said, if I introduce that package on Debian, then getting it to work on Ubuntu should be relatively simple.
I can TRY TO provide a precompiled version of it on a PPA that I have (where I have other tools that I find useful).
In the mean time, the unfinished (but working) package is at: https://github.com/rbrito/pkg-pdfbeads
It works very well for me and I will try this script to see how well things go when we mix everything together.
Used it just now and everything worked perfectly. Only quirk was I had to roll back gem update --system 3.0.8
to get rmagick
to install properly and stop complaining that constant Gem::ConfigMap is deprecated
(issue and fix discussed here).
> gem list rmagick iconv pdfbeads
*** LOCAL GEMS ***
rmagick (4.2.5, 2.16.0)
iconv (1.0.8)
pdfbeads (1.1.3)
Thanks so much for this helpful script!!
At this point it would be quite a task to consistently roll back all the software updates I listed to recreate the original environment when I started. My environment was quite similar to what you have listed above. However, it was missing
rmagick
and, if I remember correctly, that was what opened the Pandora's box. The only option available withapt
was adding the current versionRMagick 4.0.0
required a newer version ofruby
andgcc
, and that led me down the garden path I described above. If I knew that will be the case I would have made an effort to find the way to install an older version. In any case, I'm surprised that the installation of this ruby gem would be so sensitive to various software updates that are subsequent to the versions in existence at the time of its release.I wonder what version of
rmagick
do you get when you rungem list rmagick
?Thanks for the suggestion to reach out to @zetah. However, that does not look promising - according to his profile, the last time he was active was in April, 2017, and his last contributions are dated in 2014.