Created
June 4, 2018 13:27
-
-
Save akashjobanputra/b59029353e405e2a60c3c8cbd1fa3ecc to your computer and use it in GitHub Desktop.
Re Snippet for normalising whitespaces and new lines. source: http://textacy.readthedocs.io/en/latest/_modules/textacy/preprocess.html#normalize_whitespace
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
LINEBREAK_REGEX = re.compile(r'((\r\n)|[\n\v])+') | |
NONBREAKING_SPACE_REGEX = re.compile(r'(?!\n)\s+') | |
def normalize_whitespace(text): | |
""" | |
Given ``text`` str, replace one or more spacings with a single space, and one | |
or more linebreaks with a single newline. Also strip leading/trailing whitespace. | |
""" | |
return NONBREAKING_SPACE_REGEX.sub(' ', LINEBREAK_REGEX.sub(r'\n', text)).strip() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment