Last active
March 23, 2017 11:39
-
-
Save robbypelssers/5186812 to your computer and use it in GitHub Desktop.
Unicode Normalization
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import java.text.Normalizer | |
/** | |
* Problem: Characters with accents or other adornments can be encoded in several different ways in Unicode | |
* However, from a user point of view if they logically mean the same, text search should make no distinction | |
* between the different notations. So it's important to store text in normalized unicode form. Code below shows | |
* how to check if text is normalized and how you can normalize it. | |
**/ | |
object NormalizationTest { | |
def main(args: Array[String]) { | |
val text = "16-bit transceiver with direction pin, 30 Ω series termination resistors;" | |
println(text) | |
println(Normalizer.isNormalized(text, Normalizer.Form.NFC)) | |
val normalizedText = Normalizer.normalize(text, Normalizer.Form.NFC) | |
println(normalizedText) | |
println(Normalizer.isNormalized(normalizedText, Normalizer.Form.NFC)) | |
} | |
} | |
/** | |
* Output printed to console: | |
* ------------------------------- | |
* | |
* 16-bit transceiver with direction pin, 30 Ω series termination resistors; | |
* false | |
* 16-bit transceiver with direction pin, 30 Ω series termination resistors; | |
* true | |
*/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment