Analyzer | Description | Example Use |
---|---|---|
Standard | Default; breaks text by word boundaries, removes most punctuation, lowercases tokens. | English prose, general search |
Simple | Splits on non-letter, lowercases. | Part numbers, technical terms |
Whitespace | Splits on whitespace only, preserves case. | Code, serial numbers |
Keyword | Does not split; treats entire text as a single token. | Exact match fields, IDs, tags |
Feature | TF-IDF | BM25 |
---|---|---|
Full Form | Term Frequency β Inverse Document Frequency | Best Matching 25 |
Default in Elasticsearch | β (before v5.0) | β (v5.0 and later) |
Term Frequency Handling | Linear | Saturated (diminishing returns) |
Document Length Normalization | Minimal | Tunable and robust |
Tunable Parameters | No | Yes (k1 , b ) |
Use Case | TF-IDF | BM25 |
---|---|---|
Simple scoring model | β | β |
Accurate relevance for modern search | β | β |
Normalize for document length | β | β |
Tune scoring behavior with parameters | β | β |
Notes for software engineering meeting presentation
- mid to late 2000s: appearance of Document stores / NoSQL databases such as Mongo, Couch
- Relational DBs now have support for document data: JSON in MySQL, JSON and JSONB in PostgreSQL
- Focus on JSONB in Postgres (most full featured)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import phonenumbers | |
from pydantic.validators import strict_str_validator | |
class PhoneNumber(str): | |
"""Phone Number Pydantic type, using google's phonenumbers""" | |
@classmethod | |
def __get_validators__(cls): | |
yield strict_str_validator | |
yield cls.validate |
NewerOlder