Wednesday 8 July 2015

Scala: Unicode, comparing combined characters, normalization

In unicode, you can get some characters by single sequence or by combining it with another character. For example:

In scala:

val c1 = "Idee\u0301"
val c2 = "Ide\u00e9"

c1: String = Ideé
c2: String = IdeƩ

however

c1 == c2
res0: Boolean = false

so for the rescue, we need to normalize them first:

import java.text.Normalizer
val n1 = Normalizer.normalize(c1, Normalizer.Form.NFC)
val n2 = Normalizer.normalize(c2, Normalizer.Form.NFC)

n1 == n2
res1: Boolean = true

Simple!