Skip to content

Instantly share code, notes, and snippets.

@dayyass
Created June 17, 2021 13:54
Show Gist options
  • Save dayyass/a27952692625d7e1375f628de5f44420 to your computer and use it in GitHub Desktop.
Save dayyass/a27952692625d7e1375f628de5f44420 to your computer and use it in GitHub Desktop.
sklearn tokenizer used in HashingVectorizer, CountVectorizer and TfidfVectorizer.
import re
# Method build_tokenizer from _VectorizerMixin mixin from which classes HashingVectorizer, CountVectorizer and
# TfidfVectorizer (through CountVectorizer) are partially inherited.
# It is used to split a string into a sequence of tokens (only if analyzer == 'word').
def build_tokenizer(token_pattern: str = r"(?u)\b\w\w+\b"):
"""
Return a function that splits a string into a sequence of tokens.
Returns
-------
tokenizer: callable
A function to split a string into a sequence of tokens.
"""
return re.compile(token_pattern).findall
sentence = "This is sentence example."
tokenizer = build_tokenizer()
tokenizer(sentence) # ['This', 'is', 'sentence', 'example']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment