Twitter API doesn't sanitize characters such as U+0010 Data Link Escape in JSON output, while it does sanitize it into a "*" in XML. See the following files and pay attention to the text field. This becomes a problem when a 3rd party API site, such as Gtweet, generates XML-based format, in this case RSS, based on the JSON, it would result in malformed XML if the 3rd party site doesn't do sanitization.
In fact, this already broke my feed reader.
The attached files are generated by
curl http://api.twitter.com/1/statuses/show/89719449015427072.json
and
curl http://api.twitter.com/1/statuses/show/89719449015427072.xml
Three alternative solutions could solve my problem.
-
As the XML output already does sanitization to these control characters, perhaps it wouldn't raise too much performance drawback to do similar sanitization for JSON output.
-
Gtweet could do the sanitization when generating RSS. This might be the most reasonable. I am not sure.
-
Twitter could do the sanitization during user input, as control characters are considered invalid in raw HTML and shouldn't appear in the DOM when you open, say, the example tweet.