Skip to content

Instantly share code, notes, and snippets.

@rasbt
Last active August 29, 2015 14:07
Show Gist options
  • Select an option

  • Save rasbt/08e06e464aaec6d54ee6 to your computer and use it in GitHub Desktop.

Select an option

Save rasbt/08e06e464aaec6d54ee6 to your computer and use it in GitHub Desktop.
English language detection
import nltk
def eng_ratio(text):
''' Returns the ratio of non-English to English words from a text '''
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
text_vocab = set(w.lower() for w in text.split() if w.lower().isalpha())
unusual = text_vocab.difference(english_vocab)
diff = len(unusual)/len(text_vocab)
return diff
text = 'This is a test fahrrad'
print(eng_ratio(text))
# prints 0.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment