Created
June 17, 2021 13:54
-
-
Save dayyass/a27952692625d7e1375f628de5f44420 to your computer and use it in GitHub Desktop.
sklearn tokenizer used in HashingVectorizer, CountVectorizer and TfidfVectorizer.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
# Method build_tokenizer from _VectorizerMixin mixin from which classes HashingVectorizer, CountVectorizer and | |
# TfidfVectorizer (through CountVectorizer) are partially inherited. | |
# It is used to split a string into a sequence of tokens (only if analyzer == 'word'). | |
def build_tokenizer(token_pattern: str = r"(?u)\b\w\w+\b"): | |
""" | |
Return a function that splits a string into a sequence of tokens. | |
Returns | |
------- | |
tokenizer: callable | |
A function to split a string into a sequence of tokens. | |
""" | |
return re.compile(token_pattern).findall | |
sentence = "This is sentence example." | |
tokenizer = build_tokenizer() | |
tokenizer(sentence) # ['This', 'is', 'sentence', 'example'] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment