santhoshtr/lid.md

Created June 23, 2023 06:31

Star (1) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/santhoshtr/b05227b4a5450517b3d0cac170f93bea.js"></script>
Save santhoshtr/b05227b4a5450517b3d0cac170f93bea to your computer and use it in GitHub Desktop.

Download ZIP

Language identification - notes

Raw

lid.md

An Open Dataset and Model for Language Identification

From https://github.com/laurieburchell/open-lid-dataset

Paper: https://arxiv.org/pdf/2305.13820.pdf

Mode: https://data.statmt.org/lid/lid201-model.bin.gz Licensed under GPLv3

NLLB LID model for 218 languages

wget https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin

Then use it for inference

import fasttext
pretrained_lang_model = "lid218e.bin"
model = fasttext.load_model(pretrained_lang_model)
text = "これ、浅草に、行きますか"
predictions = model.predict(text, k=1) 
print(predictions)

License is CC-BY-NC

Links

https://github.com/slone-nlp/myv-nmt/blob/main/dirty-code-2022/model_training/01_multilang-detect.ipynb
https://huggingface.co/slone/fastText-LID-323
https://fasttext.cc/blog/2017/10/02/blog-post.html
https://fasttext.cc/docs/en/supervised-tutorial.html
facebookresearch/fastText#1323 Any plans to update the pre-trained model for Language Identification?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment