Operating on [UInt8]
text buffers (“bytestrings”) is a common programming task. A popular approach among some Swift users is to (ab)use the String
API, and attempt to spell familiar C-idioms using its syntax. This has the major bonus of readability, but leaves users vulnerable to many pitfalls.
A common mistake is to convert bytestrings to String
s and compare them to other String
s. Given two bytestrings a:[UInt8]
, b:[UInt8]
, many users assume that
String(decoding: a, as: Unicode.ASCII.self) ==
String(decoding: b, as: Unicode.ASCII.self)
if and only if
a == b
but this doesn’t actually hold for all bytestrings. A real-world example of where this can cause harm is when detecting the magic header for the JPEG image format, ['ÿ', 'Ø', 'ÿ', 'Û']
([0xFF, 0xD8, 0xFF, 0xDB]
). For obvious choices of Unicode codec, it is possible for an entirely different bytestring to match it.
// none of these codepoints are actually ASCII, so `Unicode.ASCII`
// is clearly the wrong codec to use.
String(decoding: [0xFF, 0xD8, 0xFF, 0xDB], as: Unicode.ASCII.self) ==
String(decoding: [0xEE, 0xC7, 0xEE, 0xCA], as: Unicode.ASCII.self)
// true
The other option, Unicode.UTF8
, has the same problem.
// both of these bytestrings are considered to be UTF-8 gibberish,
// and all gibberish strings compare equal.
String(decoding: [0xFF, 0xD8, 0xFF, 0xDB], as: Unicode.UTF8.self) ==
String(decoding: [0xEE, 0xC7, 0xEE, 0xCA], as: Unicode.UTF8.self))
// true
Indeed, the correct way to do these String
comparisons is to widen our input bytestring to 16 bits, and import it as a UTF-16 unicode string!
String(decoding: [0xFF as UInt8, 0xD8, 0xFF, 0xDB].map{ UInt16($0) },
as: Unicode.UTF16.self)
== "ÿØÿÛ"
// true (expected)
Aside from being inefficient for long strings, understanding why this is a valid identity† requires a deep understanding of Unicode, and users are highly unlikely to discover this idiom on their own. We should not require users to be Unicode experts in order to write correct bytestring code.‡
† It’s because for each grapheme, the Unicode standard defines no more than one decomposition sequence which consists solely of codepoints below
0x100
. (What, you haven’t read the Unicode Standard, Version 11.0.0?) Note that many graphemes still have multiple canonically-equivalent decomposition sequences containing at least one codepoint below0x100
. Because of this, the act of widening an 8-bit machine string to a 16-bit machine string can introduce no canonically equivalent decomposition sequences (so long as they are zero-extended), preserving the one-to-one relationship.
‡ A correct
String
comparison identity for 16-bit machine strings is left as an exercise for the reader.
False comparison positives aren’t the only correctness traps that can result from abuse of String
APIs. Many textual formats, such as XML and JSON, are defined in terms of codepoints, and parsing by Character
will lead to bugs. For example, a combining character after the opening quote of an XML attribute (as in attr="\u{308}value"
) is well-formed XML and must be parsed as a value starting with a combining character.
"\"\u{308}".count // 1
Credit to Michel Fortin for the example.
In cases like these, String
is simply the wrong tool for the job, and if all you have is a String
, then everything looks like a Character
.
In addition, extracting characters at fixed offsets is a extremely common bytestring operation. (Quick! Get the month from a "YYYY-MM-DD"
datestring!) Random access integer subscripting is extremely inefficient, by nature, on String
. Users working in String
are liable to fall into performance traps which could easily add a factor of n to their run time.