Skip to content

Instantly share code, notes, and snippets.

@yeiichi
Created March 30, 2026 09:05
Show Gist options
  • Select an option

  • Save yeiichi/d567793c66f105844e3a01870eea89cd to your computer and use it in GitHub Desktop.

Select an option

Save yeiichi/d567793c66f105844e3a01870eea89cd to your computer and use it in GitHub Desktop.
Remove whitespace from text using different strategies
import re
class WhitespaceRemover:
"""Remove whitespace from text using different strategies.
All methods remove *Unicode whitespace* (not just ASCII spaces).
Choose a method based on readability, performance, and workload size.
"""
_WHITESPACE_RE = re.compile(r"\s+")
_TRANSLATION_TABLE = None
@classmethod
def _get_translation_table(cls) -> dict[int, None]:
if cls._TRANSLATION_TABLE is None:
cls._TRANSLATION_TABLE = {
codepoint: None
for codepoint in range(0x110000)
if chr(codepoint).isspace()
}
return cls._TRANSLATION_TABLE
@staticmethod
def simple(text: str) -> str:
"""Remove whitespace using a Pythonic character filter.
Characteristics:
- Most readable and explicit implementation.
- Uses str.isspace(), so it correctly handles all Unicode whitespace,
including IDEOGRAPHIC SPACE (U+3000), non-breaking spaces, etc.
- No setup overhead.
Trade-offs:
- Slower than regex/translate for large strings or heavy batch processing.
Recommended for:
- Small to medium strings.
- Situations where clarity and maintainability are prioritized.
- One-off transformations or scripts.
"""
return "".join(char for char in text if not char.isspace())
@classmethod
def regex(cls, text: str) -> str:
"""Remove whitespace using a precompiled regular expression.
Characteristics:
- Uses '\\s+' which matches Unicode whitespace in Python.
- Faster than pure Python loops for most real-world inputs.
- Concise and widely understood.
Trade-offs:
- Slight overhead from regex engine.
- Less explicit than the `simple` method.
Recommended for:
- General-purpose usage (best balance of speed and readability).
- Medium to large strings.
- Batch processing where performance matters but simplicity is still desired.
"""
return cls._WHITESPACE_RE.sub("", text)
@classmethod
def translate(cls, text: str) -> str:
"""Remove whitespace using str.translate() with a Unicode map.
Characteristics:
- Fastest approach for large-scale or repeated processing.
- Operates at C level via str.translate().
- Removes all characters where str.isspace() is True,
including IDEOGRAPHIC SPACE (U+3000) and other Unicode whitespace.
Trade-offs:
- Requires building a large translation table (~1M code points).
- Higher memory usage and initialization cost (lazy-loaded here).
- Less intuitive than other methods.
Recommended for:
- High-throughput pipelines.
- Processing very large strings or large batches.
- Performance-critical applications where startup cost is acceptable.
"""
return text.translate(cls._get_translation_table())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment