Skip to content

Instantly share code, notes, and snippets.

@luisenriquecorona
Created February 8, 2020 03:00
Show Gist options
  • Save luisenriquecorona/6200423cb16f69760995c000f060084b to your computer and use it in GitHub Desktop.
Save luisenriquecorona/6200423cb16f69760995c000f060084b to your computer and use it in GitHub Desktop.
If you build a regular expression with bytes, patterns such as \d and \w only match ASCII characters; in contrast, if these patterns are given as str, they match Unicode digits or letters beyond ASCII.
import re
re_numbers_str = re.compile(r'\d+')
re_words_str = re.compile(r'\w+')
re_numbers_bytes = re.compile(rb'\d+')
re_words_bytes = re.compile(rb'\w+')
text_str = ("Ramanujan saw \u0be7\u0bed\u0be8\u0bef"
" as 1729 = 1³ + 12³ = 9³ + 10³.")
text_bytes = text_str.encode('utf_8')
print('Text', repr(text_str), sep='\n ')
print('Numbers')
print(' str :', re_numbers_str.findall(text_str))
print(' bytes:', re_numbers_bytes.findall(text_bytes))
print('Words')
print(' str :', re_words_str.findall(text_str))
print(' bytes:', re_words_bytes.findall(text_bytes))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment