Skip to content

Instantly share code, notes, and snippets.

@tayloraswift
Last active October 23, 2018 18:36
Show Gist options
  • Save tayloraswift/516a5e3bc699a6b72009ff23f836a4bd to your computer and use it in GitHub Desktop.
Save tayloraswift/516a5e3bc699a6b72009ff23f836a4bd to your computer and use it in GitHub Desktop.
stop converting data to string dot mp3

Stop converting Data to String

Operating on [UInt8] text buffers (“bytestrings”) is a common programming task. A popular approach among some Swift users is to (ab)use the String API, and attempt to spell familiar C-idioms using its syntax. This has the major bonus of readability, but leaves users vulnerable to many pitfalls.

A common mistake is to convert bytestrings to Strings and compare them to other Strings. Given two bytestrings a:[UInt8], b:[UInt8], many users assume that

String(decoding: a, as: Unicode.ASCII.self) == 
String(decoding: b, as: Unicode.ASCII.self)

if and only if

a == b

but this doesn’t actually hold for all bytestrings. A real-world example of where this can cause harm is when detecting the magic header for the JPEG image format, ['ÿ', 'Ø', 'ÿ', 'Û'] ([0xFF, 0xD8, 0xFF, 0xDB]). For obvious choices of Unicode codec, it is possible for an entirely different bytestring to match it.

// none of these codepoints are actually ASCII, so `Unicode.ASCII` 
// is clearly the wrong codec to use. 

String(decoding: [0xFF, 0xD8, 0xFF, 0xDB], as: Unicode.ASCII.self) == 
String(decoding: [0xEE, 0xC7, 0xEE, 0xCA], as: Unicode.ASCII.self)
// true

The other option, Unicode.UTF8, has the same problem.

// both of these bytestrings are considered to be UTF-8 gibberish, 
// and all gibberish strings compare equal.

String(decoding: [0xFF, 0xD8, 0xFF, 0xDB], as: Unicode.UTF8.self) == 
String(decoding: [0xEE, 0xC7, 0xEE, 0xCA], as: Unicode.UTF8.self))
// true

Indeed, the correct way to do these String comparisons is to widen our input bytestring to 16 bits, and import it as a UTF-16 unicode string!

String(decoding: [0xFF as UInt8, 0xD8, 0xFF, 0xDB].map{ UInt16($0) }, 
             as: Unicode.UTF16.self) 
             == "ÿØÿÛ"
// true (expected)

Aside from being inefficient for long strings, understanding why this is a valid identity requires a deep understanding of Unicode, and users are highly unlikely to discover this idiom on their own. We should not require users to be Unicode experts in order to write correct bytestring code.

† It’s because for each grapheme, the Unicode standard defines no more than one decomposition sequence which consists solely of codepoints below 0x100. (What, you haven’t read the Unicode Standard, Version 11.0.0?) Note that many graphemes still have multiple canonically-equivalent decomposition sequences containing at least one codepoint below 0x100. Because of this, the act of widening an 8-bit machine string to a 16-bit machine string can introduce no canonically equivalent decomposition sequences (so long as they are zero-extended), preserving the one-to-one relationship.

‡ A correct String comparison identity for 16-bit machine strings is left as an exercise for the reader.

False comparison positives aren’t the only correctness traps that can result from abuse of String APIs. Many textual formats, such as XML and JSON, are defined in terms of codepoints, and parsing by Character will lead to bugs. For example, a combining character after the opening quote of an XML attribute (as in attr="\u{308}value") is well-formed XML and must be parsed as a value starting with a combining character.

"\"\u{308}".count // 1

Credit to Michel Fortin for the example.

In cases like these, String is simply the wrong tool for the job, and if all you have is a String, then everything looks like a Character.

In addition, extracting characters at fixed offsets is a extremely common bytestring operation. (Quick! Get the month from a "YYYY-MM-DD" datestring!) Random access integer subscripting is extremely inefficient, by nature, on String. Users working in String are liable to fall into performance traps which could easily add a factor of n to their run time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment