From https://github.com/laurieburchell/open-lid-dataset
Paper: https://arxiv.org/pdf/2305.13820.pdf
Mode: https://data.statmt.org/lid/lid201-model.bin.gz Licensed under GPLv3
wget https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin
Then use it for inference
import fasttext
pretrained_lang_model = "lid218e.bin"
model = fasttext.load_model(pretrained_lang_model)
text = "これ、浅草に、行きますか"
predictions = model.predict(text, k=1)
print(predictions)
License is CC-BY-NC
- https://github.com/slone-nlp/myv-nmt/blob/main/dirty-code-2022/model_training/01_multilang-detect.ipynb
- https://huggingface.co/slone/fastText-LID-323
- https://fasttext.cc/blog/2017/10/02/blog-post.html
- https://fasttext.cc/docs/en/supervised-tutorial.html
- facebookresearch/fastText#1323 Any plans to update the pre-trained model for Language Identification?