Skip to content

Instantly share code, notes, and snippets.

@rendello
rendello / _utf8_case_data.rs
Last active November 6, 2024 18:17
Unicode codepoints that expand or contract when case is changed in UTF-8. Good for testing parsers. Includes the data `utf8_case_data.rs` and the script to generate it, `generate_utf8.py`.
/*
Copyright (c) 2024 Rendello
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
@rendello
rendello / generate_utf8.py
Last active November 1, 2024 18:03
For the generation of the " UTF-8 characters that behave oddly when the case is changed".
# Moved Gist, now combined with the data at:
# https://gist.github.com/rendello/d37552507a389656e248f3255a618127
@rendello
rendello / _unicode_roundtrip_unsafe.txt
Last active November 6, 2024 18:17
Unicode roundtrip-unsafe characters. They change to different characters (or sets of characters) when case is changed and then changed back.
Uppercase -> lowercase -> uppercase:
İ i̇ İ LATIN CAPITAL LETTER I WITH DOT ABOVE -> LATIN SMALL LETTER I, COMBINING DOT ABOVE -> LATIN CAPITAL LETTER I, COMBINING DOT ABOVE
Ω ω Ω OHM SIGN -> GREEK SMALL LETTER OMEGA -> GREEK CAPITAL LETTER OMEGA
ẞ ß SS LATIN CAPITAL LETTER SHARP S -> LATIN SMALL LETTER SHARP S -> LATIN CAPITAL LETTER S, LATIN CAPITAL LETTER S
K k K KELVIN SIGN -> LATIN SMALL LETTER K -> LATIN CAPITAL LETTER K
Å å Å ANGSTROM SIGN -> LATIN SMALL LETTER A WITH RING ABOVE -> LATIN CAPITAL LETTER A WITH RING ABOVE
ϴ θ Θ GREEK CAPITAL THETA SYMBOL -> GREEK SMALL LETTER THETA -> GREEK CAPITAL LETTER THETA
Lowercase -> uppercase -> lowercase:
ῗ Ϊ͂ ῗ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI -> GREEK CAPITAL LETTER IOTA, COMBINING DIAERESIS, COMBINING GREEK PERISPOMENI -> GREEK SMALL LETTER IOTA, COMBINING DIAERESIS, COMBINING GREEK PERISPOMENI