Last active
March 8, 2022 18:56
-
-
Save joshua-taylor/4983f1d113d50546cb3cbeef73cce715 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from gensim.models.fasttext import FastText | |
ft_model = FastText( | |
sg=1, # use skip-gram: usually gives better results | |
size=100, # embedding dimension (default) | |
window=10, # window size: 10 tokens before and 10 tokens after to get wider context | |
min_count=5, # only consider tokens with at least n occurrences in the corpus | |
negative=15, # negative subsampling: bigger than default to sample negative examples more | |
min_n=2, # min character n-gram | |
max_n=5 # max character n-gram | |
) | |
ft_model.build_vocab(tok_text) # tok_text is our tokenized input text - a list of lists relating to docs and tokens respectivley | |
ft_model.train( | |
tok_text, | |
epochs=6, | |
total_examples=ft_model.corpus_count, | |
total_words=ft_model.corpus_total_words) | |
ft_model.save('_fasttext.model') #save | |
ft_model = FastText.load('_fasttext.model') #load |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It looks like in the newer version of FastText, this may need to be updated to read vector_size=100 instead of size=100.