Created
June 28, 2019 22:00
-
-
Save abadger/bab2c5c5ed7f169c433e62389803af01 to your computer and use it in GitHub Desktop.
When are native literal strings safe?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Why do we have unadorned string literals (native strings) in our codebase? | |
Doesn't that put us in danger of UnicodeError exceptions? | |
(1) Your codebase should be using text by default. At the borders, you convert | |
strings from other APIs into text and then use text throughout, only | |
converting to bytes (or native strings) when those types are needed for | |
another, outside API. | |
(2) On Python2, text can be safely combined with (or compared to) text[1]_. Bytes | |
can be combined with bytes. And ascii-only bytes can be combined with text. | |
(3) On Python2, native strings are text so they follow the same rules as bytes: | |
Safe to combine native strings with bytes. Only safe to combine ascii-only | |
native strings with text. | |
(4) On Python3, text can be safely combined with text. Bytes can be combined | |
with bytes. Bytes and text can **never** be safely combined without an | |
explicit conversion of one value or the other. | |
(5) On Python3, native strings are text so they follow the same rules as tet: | |
Only safe to combine native strings with text. | |
If you understand all of the above, you'll find that the subset of safe types | |
to combine together on both Python2 and Python3 are: text with text, bytes with | |
bytes, and **ascii-only** native strings with text. That last part is because | |
native strings are text on Python3 and ascii-only byte strings are safe to | |
combine with text on Python2. | |
.. [1]_: Combined with includes `str.join()`, %-formatted strings, and | |
concatenation with ``+``. `str.format()` needs to be understood to | |
use safely, though. The other methods will always convert the byte | |
string to a text string using the ascii encoding. str.format will | |
convert its arguments to the type of string that it's a method of. | |
.. seealso:: https://anonbadger.wordpress.com/2016/01/05/python2-string-format-and-unicode/ | |
So, some examples: | |
This is safe to do:: | |
filenames = ('/path/one', '/path/two') | |
if pathname in filenames: | |
print('We are inside a recognized directory') | |
Following our coding guidelines (bullet point 1 in our list above), pathname | |
contains a text string. On Python2, the values in filenames will be converted | |
to text strings safely because they only contain ascii characters and compared. | |
On Python3, the values in filenames are text strings and so the comparison | |
doesn't need to do any conversion so the comparison will be safe. | |
This is unsafe to do:: | |
filenames = os.listdir('.') | |
if u'one' in filenames: | |
print('Directory contains a recognized file') | |
In this example, filenames is getting native strings from a third-party API. | |
We can't control whether there are non-ascii characters in the filenames there. | |
So when we check to see if u'one' is one of the filenames, we are in danger of | |
a UnicodeError on Python2. That's because the filenames on Python2 would be | |
a byte string. So, in the comparison, Python2 will attempt to convert it into | |
a text string to match u'one'. In doing so, it will use the ascii encoding. | |
A non-ascii filename will traceback in this case. | |
So, similar to how we use a `b_` prefix when we want a variable to hold a byte | |
string, a variable which holds native strings needs to be prefixed with `n_` | |
when we can't rule out a variable holding non-ascii characters. In practice, | |
the easiest rule to follow is if you're setting the variable to a string | |
literal which only contains ascii characters, you are safe. If you set the | |
variable to a string literal with non-ascii characters *or* you set the | |
variable to a native string from a function call, then the variable should be | |
prefixed with an `n_` to warn that you have to think about the corner cases | |
when combining this with other non-native variables. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment