Unicode Explorer

By @MonoidMusician

Enter some text. Select some text. Have some fun.

Note: grapheme analysis is out of scope.

Encodings

Ordering

UTF-8 and UTF-32 have consistent ordering.Citations needed. But UTF-16 is kind of fucked up.

The reason that UTF-16 is fucked up is that surrogates are taken from U+D800 through U+DFFF (and are used to encode U+010000 through U+10FFFF – the astral characters), but U+E000 through U+FFFF are still valid code points. Is there a standard name for these? I will call them “High BMP”.

In UTF-16, this High BMP region will compare as greater than the astral characters.

Things get worse if you allow unpaired surrogates …

Example code

Code
import Prelude
import Control.Alternative (guard)
import Data.Array as Array
import Data.Enum (toEnum)
import Partial.Unsafe (unsafeCrashWith)

-- LowBMP < Astral < HighBMP
data Region = LowBMP | Astral | HighBMP
derive instance Eq Region
derive instance Ord Region

-- Will crash on surrogates (U+D800 to U+DFFF)
compareUTF16 :: Array CodePoint -> Array CodePoint -> Ordering
compareUTF16 l r = Array.fold
  -- character-by-character comparison
  [ Array.fold (Array.zipWith cmp16 l r)
  -- and if they compared to be equal, then look at lengths
  , compare (Array.length l) (Array.length r)
  ]
  where
  cmp16 :: CodePoint -> CodePoint -> Ordering
  cmp16 cp1 cp2 =
    -- first look at the region
    (compare (regionOf cp1) (regionOf cp2)) <>
    -- then at the specific value
    (compare cp1 cp2)

regionOf :: CodePoint -> Region
regionOf cp = exactlyOneOf $ Array.catMaybes
  [ LowBMP <$ guard (isLowBMP cp)
  , Astral <$ guard (isAstral cp)
  , HighBMP <$ guard (isHighBMP cp)
  ]

isLowBMP :: CodePoint -> Boolean
isLowBMP = region 0x0000 0xD7FF

isHighBMP :: CodePoint -> Boolean
isHighBMP = region 0xE000 0xFFFF

isAstral :: CodePoint -> Boolean
isAstral = region 0x010000 0x10FFFF

region :: Int -> Int -> CodePoint -> Boolean
region cpLow cpHigh cp = cpLit cpLow <= cp && cp <= cpLit cpHigh

cpLit :: Int -> CodePoint
cpLit i = case toEnum i of
  Nothing -> unsafeCrashWith ""
  Just cp -> cp

exactlyOneOf :: forall a. Array a -> a
exactlyOneOf [a] = a
exactlyOneOf [] = unsafeCrashWith "No options in exactlyOneOf"
exactlyOneOf _ = unsafeCrashWith "Too many options in exactlyOneOf"

Endianness

I am hoping I do not have to cover endianness here …

Unpaired Surrogates

Layout of Unicode

https://stackoverflow.com/questions/52203351/why-is-unicode-restricted-to-0x10ffff

https://en.wikipedia.org/wiki/Specials_(Unicode_block)

https://en.wikipedia.org/wiki/Private_Use_Areas

Never Assigned

Categories

https://en.wikipedia.org/wiki/Unicode_character_property#General_Category

Combining characters are assigned the Unicode major category “M” (“Mark”). https://en.wikipedia.org/wiki/Combining_character

  • Mn = Mark, nonspacing
  • Mc = Mark, spacing combining
  • Me = Mark, enclosing

Terminology

Misc