Skip to content

Instantly share code, notes, and snippets.

@janithl
Last active December 18, 2016 06:23
Show Gist options
  • Save janithl/fbaf392187d11e29dfd5e4a8e32edc9b to your computer and use it in GitHub Desktop.
Save janithl/fbaf392187d11e29dfd5e4a8e32edc9b to your computer and use it in GitHub Desktop.
from collections import defaultdict
UNICODE_BLOCKS = {
'en': range(0x0000, 0x02AF),
'si': range(0x0D80, 0x0DFF),
'ta': range(0x0B80, 0x0BFF),
'dv': range(0x0780, 0x07BF)
}
def getlang(text):
"""Get language via Unicode range. Partially based on:
https://github.com/kent37/guess-language/blob/master/guess_language/guess_language.py#L344
"""
run_types = defaultdict(int)
for c in text:
if(c.isalpha()):
for block in UNICODE_BLOCKS:
if(ord(c) in UNICODE_BLOCKS[block]):
run_types[block] += 1
return max(run_types, key=run_types.get)
@pathumego
Copy link

nice :)

@kiriappeee
Copy link

can you add some sample data in for this? Just want to try an overthought out optimization

@janithl
Copy link
Author

janithl commented Dec 18, 2016

@kiriappeee Hey man, sorry I just saw this! Made a quick and incomplete test file. https://gist.github.com/janithl/bdc5d0470e024cc284fb777c92081428

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment