Skip to content

Instantly share code, notes, and snippets.

@emres
Created December 18, 2011 21:58
Show Gist options
  • Save emres/1494606 to your computer and use it in GitHub Desktop.
Save emres/1494606 to your computer and use it in GitHub Desktop.
Language detection in Bash command line
en_res=$(echo $(cat some_en.txt EN | gzip | wc -c) - $(gzip -c EN | wc -c) | bc); \
de_res=$(echo $(cat some_en.txt DE | gzip | wc -c) - $(gzip -c DE | wc -c) | bc); \
(echo $en_res EN && echo $de_res DE) | sort -n | head -1 | cut -d' ' -f2
@shelomito12
Copy link

I get the following error in Ubuntu 20.04:

(standard_in) 1: illegal character: \342
(standard_in) 1: illegal character: \200
(standard_in) 1: illegal character: \223
(standard_in) 1: syntax error
(standard_in) 1: illegal character: \342
(standard_in) 1: illegal character: \200
(standard_in) 1: illegal character: \223
(standard_in) 1: syntax error
EN

Is there another easy linux command to detect language output ?

@emres
Copy link
Author

emres commented Feb 7, 2022

Hello, you might want to try Apache Tika or polyglot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment