Skip to content

Instantly share code, notes, and snippets.

@joshua-taylor
Last active March 8, 2022 18:56
Show Gist options
  • Save joshua-taylor/4983f1d113d50546cb3cbeef73cce715 to your computer and use it in GitHub Desktop.
Save joshua-taylor/4983f1d113d50546cb3cbeef73cce715 to your computer and use it in GitHub Desktop.
from gensim.models.fasttext import FastText
ft_model = FastText(
sg=1, # use skip-gram: usually gives better results
size=100, # embedding dimension (default)
window=10, # window size: 10 tokens before and 10 tokens after to get wider context
min_count=5, # only consider tokens with at least n occurrences in the corpus
negative=15, # negative subsampling: bigger than default to sample negative examples more
min_n=2, # min character n-gram
max_n=5 # max character n-gram
)
ft_model.build_vocab(tok_text) # tok_text is our tokenized input text - a list of lists relating to docs and tokens respectivley
ft_model.train(
tok_text,
epochs=6,
total_examples=ft_model.corpus_count,
total_words=ft_model.corpus_total_words)
ft_model.save('_fasttext.model') #save
ft_model = FastText.load('_fasttext.model') #load
@bigfoot504
Copy link

It looks like in the newer version of FastText, this may need to be updated to read vector_size=100 instead of size=100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment