Skip to content

Instantly share code, notes, and snippets.

@MLWhiz
Created February 9, 2019 08:02
Show Gist options
  • Save MLWhiz/1a30b1eeb933206611d79d129afe7d9f to your computer and use it in GitHub Desktop.
Save MLWhiz/1a30b1eeb933206611d79d129afe7d9f to your computer and use it in GitHub Desktop.
cnt_vectorizer = CountVectorizer(dtype=np.float32,
strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 3),min_df=3)
# we fit count vectorizer to get ngrams from both train and test data.
cnt_vectorizer.fit(list(train_df.cleaned_text.values) + list(test_df.cleaned_text.values))
xtrain_cntv = cnt_vectorizer.transform(train_df.cleaned_text.values)
xtest_cntv = cnt_vectorizer.transform(test_df.cleaned_text.values)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment