Skip to content

Instantly share code, notes, and snippets.

@gaphex
Created June 23, 2019 12:06
Show Gist options
  • Save gaphex/518c7cdb47095f067a49e064e1ad37da to your computer and use it in GitHub Desktop.
Save gaphex/518c7cdb47095f067a49e064e1ad37da to your computer and use it in GitHub Desktop.
generate embeddings for articles from the Reuters news corpus
from nltk.corpus import reuters
nltk.download("reuters")
nltk.download("punkt")
max_samples = 256
categories = ['wheat', 'tea', 'strategic-metal',
'housing', 'money-supply', 'fuel']
S, X, Y = [], [], []
for category in categories:
print(category)
sents = reuters.sents(categories=category)
sents = [' '.join(sent) for sent in sents][:max_samples]
X.append(bert_vectorizer(sents, verbose=True))
Y += [category] * len(sents)
S += sents
X = np.vstack(X)
X.shape
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment