Most text nowadays is encoded as unicode, and yet some programming languages either don't natively support it, or require you to jump through some hoops to support it. Elixir and Haskell have good support for Unicode, however there are some issues we have to understand, in order to manipulate text well.
Computer systems often default on a common character set called ASCII. Before we dig into Unicode, let's understand what a character set is.
A character set is just a set of characters. ASCII has 128 characters, while Unicode's UTF8 has 1,114,112 characters. Not every character is printable, some are called control characters. These characters can format or structure the text, examples include newlines, tabs and nulls.
ASCII characters are mostly English characters. So in order to support foreign languages or mathematical symbols, we need to expand our character count. Unicode tries to contain every possible character available on Earth, and still some.
A character set is enumerated by natural numbers (starting from 0) called codepoints. These codepoints serve as labels to every character in a character set. So that way, we can refer to a particular character just by using a codepoint number, and not have to actually draw it.
A text encoding is an isomorphic function that maps the characters in the character set to a particular binary/bitstring/bytestring sequence. Everything in modern computers is stored as binary. So therefore every character in a character set is also stored as some sort of binary sequence. An important realisation is that a character in a character set has nothing to do with the font you are using. The fonts provide a specific graphical rendering for some canonical character data in the character set.
It is isomorphic because given a particular encoding, it is possible to map the binary sequence to the character labeled by a codepoint, and vice versa.
Let's see how ASCII works first.
The ASCII text encoding function maps every character in ASCII to a single byte. Remember that a byte is a 8 bit sequence. This means something like:
\0 = 0 = [0,0,0,0,0,0,0,0]
...etc
a = 97 = [0,1,1,0,0,0,0,1]
b = 98 = [0,1,1,0,0,0,1,0]
...etc
\d = 127 = [0,1,1,1,1,1,1,1]
As we said before ASCII has 128 characters. It's codepoints start from 0 and end at 127.
It's text encoding function follows decimal to binary conversion. For example converting
the base 10 97 decimal to base 2 gives us [1,1,0,0,0,0,1]
, however to convert it to a
byte, we need to pad it until it reaches a multiple of 8, which gives us [0,1,1,0,0,0,0,1]
.
Although this is true for ASCII, codepoint numbers in other text encodings need not follow
this convention.
You may notice that the most significant bit is always 0
. This is why ASCII is sometimes
called 7 bit text encoding. When ASCII was first introduced, 8 bits to a byte was not yet
a standard. The 8th bit sometimes used for extensions to ASCII.
The fact that every character in ASCII maps to a byte 8 bits, means that the memory size of an ASCII string is equal to the character length of an ASCII string. You can therefore acquire the length or size of the string in constant time.
The reason why this works is because from 0 to 255 (the first 256 decimals), a decimal to binary mapping can keep all the codepoints inside 1 byte. At 256 and beyond, any decimal to binary mapping can no longer be kept inside 1 byte. Of course ASCII only needs 0 to 127 codepoints.
In Haskell and Elixir, ASCII is fully supported. Most characters are writable, and for the non-printable characters, you can use their escape code. See: http://en.wikipedia.org/wiki/ASCII#ASCII_control_code_chart
UTF-8 is an ASCII compatible text encoding that includes 1,114,112 characters. This allows us to support virtually any language on Earth. The ASCII compatibility means that the first 128 codepoint to binary mappings is exactly the same as ASCII.
It is considered a multibyte variable width encoding, meaning that a particular character codepoint can be mapped to 1, 2, 3, or 4 bytes. This means the memory size of the string will be at most 4× the number of characters that exist inside the string. This means if the string size was 4000 bytes, then you can be sure the string character length is somewhere between 1000 to 4000 characters. You cannot figure out the character length in constant time, you will need to iterate through the string to find out the real character length.
Codepoints in UTF-8 do not necessarily follow decimal to binary conversion. It certainly does for the first ASCII compatible 128 character codepoints from 0 to 127. All of these fit into 1 byte.
Beyond 127, codepoints can map to binary sequences that aren't related to decimal base 10 to
binary base 2 conversions. For example, the 254 codepoint using decimal to binary would map to
[1,1,1,1,1,1,1,0]
which fits in 1 byte. However the 254 codepoint in UTF-8 is in fact
[1,1,0,0,0,0,1,1,1,0,1,1,1,1,1,0]
which is 2 bytes.
This website allows you to quickly convert UTF-8 characters to their isomorphic representations: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi
Both Haskell and Elixir have excellent support for UTF-8. Let's investigate.
-- | Straight Codepoint Syntax. Use putStrLn, as print is for debugging, not outputting strings.
> putStrLn "\87" -- ^ W
> putStrLn "\1103" -- ^ я
> print "\1103" -- ^ "\1103"
-- | Manipulating codepoint numbers
-- If you print a '\1103' or "\1103", it's not literally \1103. It's an isomorphic debugging
-- representation of the real binary sequence.
-- `toEnum` is return type polymorphic, it takes an integer to the Enum typeclass. You can then
-- type hint it or type constrain it to one of the possible instances.
> toEnum 1103 :: Char -- ^ '\1103'
> fromEnum 'я' :: Int -- ^ 1103
> (succ $ toEnum 35) :: Char -- ^ '$' (you can enumerate codepoints!)
>
> let x = toEnum 1103
> x :: Char -- ^ gives the character
> x :: Integer -- ^ gives back Integer
> x :: Int -- ^ gives back Int
>
> putStrLn (x:[])
> putStrLn [x]
-- | Using the Data.Char library
> import Data.Char
>
> chr 1103 -- ^ '\1103' ~ 'я'
> ord 'я' -- ^ 1103
-- | Haskell supports codepoint, escape code, hexadecimal, octal, escaped abbreviation,
-- and direct source.
-- These notations can be used inside of string constructors `""` and character constructors `''`
-- and raw values.
> let a = "\70"
> let b = "\b"
> let c = "\xFF"
> let d = "\o777"
> let e = "\NUL"
> let f = "ł"
> putStrLn (a ++ b ++ c ++ d ++ e)
-- | Raw value notation, currently only hex or octal, no binary notation
-- These are stored as integers. That is `511 == 0o777`.
> let g = 0xFF
> let h = 0o777
> print g >> print h
-- | Using Data.Text
-- We can convert raw binary strings into the `Text` type, which is
-- a space efficient, packed, unboxed Unicode text type.
-- 'Text' is basically binary data equipped with a character encoding.
-- This library gives more specific functions to work with Unicode text.
-- See this for working with text or bytestrings: http://blog.ezyang.com/2010/08/strings-in-haskell/
> import Data.Text
> (unpack . pack $ "я") == "я"
Other Resources:
- http://book.realworldhaskell.org/read/characters-strings-and-escaping-rules.html
- https://honza.ca/2012/10/haskell-strings
Haskell has an extension called OverloadedStrings
, this allows literal strings to be written
in a polymorphic manner. Usually any string is exactly of a String
type. But with
OverloadedStrings
, you can do things like:
> :set -XOverloadedStrings
> "hello"
"hello"
it :: Data.String.IsString a => a
This makes literal strings polymorphic over the IsString
type class. By default it only has 1
instance, which is the String
type. However if you import string handling libraries, such as
Data.Text
or Data.ByteString
, you can type hint a literal string to be of a Text
type or
a ByteString
type.
This makes the literal string constructor ""
much more flexible and extensible by third party
libraries. The type of the string can be either type hinted or inferred from subsequent operations.
It also means you avoid having to write conversions from the native String
type whenever you want
to do complex string manipulation.
Elixir represents strings as binary sequences. It also has characters which exist as a list. So the ''
syntax is actually the same as the Haskell String
type. As it is in fact a [Char]. There's no individual Char in Elixir though. So it's always a character list. But normal strings exist in binary form, so you have to use functions that work on binary sequences to manipulate them. This gives rise to a pretty cool binary comprehension syntax.
## Character to Codepoint
> ?я # 1103 (this is a unary function, you can use it in expressions as well)
> to_char_list "я" # [1103]
> 'я' # [1103]
## Codepoint to Character
> <<1103::utf8>> # "я" (it's also possible to use a variable like `<<var::utf8>>`)
> to_string [1103] # "я"
> "\x{44F}" # "я" (more like hex to character)
> IO.chardata_to_string [1103] # "я"
> IO.chardata_to_string [0x44F] # "я"
> IO.chardata_to_string [0b10001001111] # "я"
> Integer.to_char_list(1103, 16) # "44F" (converts the 1103 decimal base 10 to 44F base 16 hex)
# This is incredibly useful function, as it allows you convert a
# base 10 integer to any base. Remember that it will return a
# character list, because other bases aren't necessarily integers.
## Raw value notation
## These are stored as integers.
## One unique thing is that Elixir has binary syntax!
> a = 0b1010
> b = 0xFF
> c = 0o777
## Elixir supports direct UTF8-source, escape code, escaped abbreviation, hex code,
## strange codepoint, hex code expression.
## The codepoing doesn't seem to match the UTF-8 codepoint mapping. I'm not sure what
## Elixir is using for their codepoint mapping.
> a = "ł"
> b = "\b"
> c = "\NUL"
> d = "\xFF"
> f = "\126"
> g = "\x{332}"
Binary comprehension syntax is quite special in Elixir and deserves its own section.
> <<73 :: utf8>>
> <<10029 :: utf8>>
The value on the right of ::
is the size of the left. So what that means is that 73
is to be considered as a utf8
size. Which is variable-width. But you can also used fixed sizes such as<<126 :: 8>>
which is binary and is also ASCII, it convert 126 into a 8 sized bitstring, giving us a byte representing ~
. The 8 bit size is the default size, and you can just use <<126>>
. However there is a limit. The largest decimal integer in a single byte is 255, if you try <<256>>
it doesn't fit into a single byte, and so it returns garbage.
We can easily take any binary representation in Elixir, and comprehend its constituent bits.
> for << b::1 <- <<1103::utf8>> >>, do: b
## [1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1]
It takes the 256 codepoint as a UTF-8 character, and then iterates through the binary sequence 1 bit at a time and returns it as a list of bits, 1s and 0s.
It's important to realise that it actually it returns a list of base 10 decimal integers. This is because it takes each bit sequence and converts it to base 10. For the above this works like 0 => 0 and 1 => 1.
We can also change the iterated bit size to 4. This just means it iterates through the UTF-8 binary sequence 4 bits at a time. This means 4 bits will be considered during the conversion to base 10 decimal. This works like 0,0,0,0
=> 0
; 0,0,0,1
=> 1; 0,0,0,0
=> 0; 0,0,0,0
=> 0.
> for << b::4 <- <<256::16>> >>, do: b
> # [0, 1, 0, 0]
As you can see this is pretty expressive way of directly manipulating binary sequences. This means you can easily work with binary protocols or other kinds of character encodings. For more information on these kind of expressions, see: http://elixir-lang.org/docs/v1.0/elixir/Kernel.SpecialForms.html#%3C%3C%3E%3E/1
Lastly, for more information working with unicode in a software context, see: http://unicodebook.readthedocs.org/en/latest/index.html
Haskell has a language extension -XBinaryLiterals
that allows one to write binary literals like 0b1111
or 0B1111
. These are desugared into integers fromInteger ...
(they arent't stored as actual binary of course!).
Also for fast binary ASCII case conversions:
- lower - OR 32
- upper - AND 223
- flip - XOR 32
nice write up 👍