This requires recode
to be installed (brew install recode
or apt install recode
).
The example shows HTML files only. Adjust as required.
- (Recursively) find all files that end with
*.htm
or*.html
. - For each match, check its file type using
file
. - Extract from the reply the
ISO-8859
tag, usinggrep
andcut
(note: if you're using a more sophisticated version ofgrep
— such as ugrep — then you might be able to directly format the result and skip the piping tocut
to show only selected fields; "modern"grep
versions may also have some formatting options these days (but I have not checked). - This will give you a list of the full paths (starting on the current working directory) for all files currently known as being Latin-1.
- Pipe the resulto to
cat
(why exactly this is needed is a bit beyond me, but is some sort of shell-y requirement which baffled me for quite a while). - Feed the generated list through the usual
while read line; do ...; done
shell loop, printing each filename in turn. - Feed each filename to
recode
for converting it from Latin-1 to UTF-8, while preserving all timestamps and other attributes.
Is this the best solution? Probably not. It has the advantage of having just O(2*N) complexity (for N = number of files in the directory tree): find
does a single pass to extract all filenames; these are then fed (as if they were just one list) into a loop to do the conversion, one by one — but at this stage, they have been filtered out already (i.e. no binaries, only text files with ISO-8859-1 encoding, etc.).
find . -name "*.htm?" -exec sh -c "file {} | grep ISO-8859 | cut -d':' -f 1" \; | cat | while read line; do recode Latin-1..UTF-8 $line; done
It's possible to do everything in a single loop (e.g. O(N) complexity), but the exact command eluded me.
You can also tackle a different approach: use find
just to retrieve directory names and give you a tree of those. Then feed those to grep
, which will evaluate all the entries on each directory. The theory here is that grep
— and especially ugrep
! — might be considerably faster than find
on each directory. And it's even possible that a few tweaks might allow ugrep
(which works recursively by default!) to do all the work, and pipe the results to the while
loop. Or even execute recode
directly. Hmm. I should look more into that possibility...