Last active
October 22, 2021 13:55
-
-
Save Syncrossus/b4034d03d8f1e24bac804acefc917ff2 to your computer and use it in GitHub Desktop.
A collection of useful regular expressions, in PCRE and Python syntax
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Garbage matcher | |
# matches any string consisting of only |, <, >, +, *, ^, #, =, and hyphen chains. | |
# this is to identify patterns like ++<===> ######## <---->^^ which serve no purpose but to decorate the text | |
/(\||(--+)|(__+)|<|>|\+|\*|\^|#|=|~)+|(\\|_|\/){2,}/ | |
# Garbage matcher 2 | |
# matches anything that isn't a letter, space or basic punctuation. | |
# this is typically useful for cleaning up emojis | |
/.(?<!([a-zA-Z0-9]|,|\.|'| |\?|\!))/ | |
# layout hyphenation matcher, typically matches hyphens used as bullet points | |
/(^\ *-\ )/ | |
# e-mail address matcher | |
# the theoretical character limit for top-level domains is 63 characters | |
/([a-z]|[0-9]|\.|\+|-)+@([a-z]|[0-9]|\.|-)+\.[a-z]{2,63}/ | |
# phone number matcher | |
# see this stackoverflow post explaining how mant digits to account for in a phone number | |
# https://stackoverflow.com/a/4729239/2980717 | |
/\(?\+?(\(?[0-9]\)?(-|\ |\.)?){6,30}[0-9]/ | |
# URL matcher | |
/https?\:\/\/(www\.)?([A-z]|[0-9]|\.|-|%)+\.[A-z]{2,63}(\/([A-z]|[0-9]|-|\.|_|#)+)*\/?(\?([A-z]|[0-9]|\.|-|%|=|&|_|#|\:|\+)+)?/ | |
# https?\:\/\/(www\.)? # pretty self explanatory | |
# ([A-z]|[0-9]|\.|-|%)+ # adress | |
# +\.[A-z]{2,63} # domain name | |
# (\/([A-z]|[0-9]|-|\.|_|#)+)*\/? # /stuff/between/slashes/ | |
# (\?([A-z]|[0-9]|\.|-|%|=|&|_|#|\:|\+)+)? # request details after ? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# PAT at the beginning of each regex name stands for Pattern | |
PAT_GARBAGE = r"((\||(--+)|(__+)|<|>|\+|\*|\^|#|=|~)+|(\\|_|/){2,})" | |
PAT_GARBAGE_2 = r".(?<!([a-zA-Z0-9]|,|\.|'| |\?|\!))" | |
PAT_BULLET_HYPHENS = r"(^\s*-\ )" | |
PAT_EMAIL = r"(([A-z]|[0-9]|\.|\+)+@([A-z]|[0-9]|\.|-)+\.[A-z]{2,63})" | |
PAT_PHONE = r"(\(?\+?(\(?[0-9]\)?(-|\ |\.)?){6,30}[0-9])" | |
PAT_URL = r"(https?\://(www\.)?([A-z]|[0-9]|\.|-|%)+\.[A-z]{2,63}(/([A-z]|[0-9]|-|\.|_|#)+)*/?(\?([A-z]|[0-9]|\.|-|%|=|&|_|#|\:|\+)+)?)" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This code is released under the .