Leanpub: Publish Early, Publish Often

Supporting Unicode

Part of preparing production-ready code is making sure that it behaves as expected for supported languages and inputs. In the previous chapter, we showed how easy Go makes many common string manipulation tasks. Many of these examples were implicitly English-centric, with only some hints that there may be more brewing below the surface when it comes to handling international character sets. In this chapter we will examine strings in greater depth, and learn how to write bug-free, production-ready code that handles strings in any language supported by the Unicode standard.

We will start by taking a detour through the history of string encodings. This will then inform the rest of our discussion on handling different character sets in Go.

A very brief history of string encodings

What are string encodings, and why do we need them? You can skip this section if you already know the difference between Unicode and UTF-8, and between a character and a Unicode code point.

Consider how a computer might represent a string of text. Because computers operate on binary, human-readable text needs to be represented as binary numbers in some way. Early computer pioneers came up with one such scheme, which they called ASCII (pronounced ASS-kee). ASCII is one way of mapping characters to numbers. For example, A is 65 (binary 0100 0001, or hexadecimal 0x41), B is 66, C is 67, and so on. We could represent the ASCII-encoded string “ABC” in hexadecimal notation, like so:

0x41 0x42 0x43

ASCII defines a mapping for 127 different characters, using exactly 7 bits. For the old 8-bit systems, this was perfect. The only problem is ASCII only covers unaccented English letters.

As computers became more widespread, other countries also needed to represent their text in binary format, and unaccented English letters were not enough. So a plethora of new encodings were invented. Now when code encountered a string, it also needed to know which encoding the string is using in order to map the bytes to the correct human-readable characters.

Identifying this as a problem, a group called the Unicode consortium undertook the herculean task of assigning a number to every letter used in any language. Such a magic number is called a Unicode code point, and is represented by a U+ followed by a hexadecimal number. For example, the string “ABC” corresponds to these three Unicode code points:

U+0041 U+0042 U+0043

Notice how for the string “ABC”, the hexadecimal numbers are the same as for ASCII.

So Unicode assigns each character with a number, but it does not specify how this number should be represented in binary. This is left to the encoding. The most popular encoding of the Unicode standard is called UTF-8. UTF-8 is popular because it has some nice properties.

One nice property of UTF-8 is that every code point between 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, or up to 6 bytes. Because the first 128 Unicode code points were chosen to match ASCII, this has the side effect that English text looks exactly the same in UTF-8 as it did in ASCII. (Notice how the hexadecimally-encoded ASCII of “ABC” from earlier is the same as the Unicode code points for the same letters.)

In this chapter we will keep things simple by focusing on only these two encodings: ASCII and UTF-8. UTF-8 has become the universal standard, and supports every language your application might need, from Chinese to Klingon. But the same principles apply for any encoding, and should your application need to handle the conversion from other encodings, most common encodings are available in the golang.org/x/text/encoding package.

For a more complete history of string encoding, we recommend Joel Spolsky’s excellent blog post from 2003, titled The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

With an understanding of what string encodings are and why they exist, let’s now turn to how they are handled in Go.

Strings are byte slices

In Go, strings are read-only (or immutable) byte slices. The byte slice representing the string is not required to hold Unicode text, UTF-8 text, or any other specific encoding. In other words, strings can hold arbitrary bytes. For example, we can take a slice of bytes, and convert it to a string:

 1 package main
 2 
 3 import (
 4     "fmt"
 5 )
 6 
 7 func main() {
 8     b := []byte{65, 66, 67}
 9     s := string(b)
10     fmt.Println(s)
11     // Output: ABC
12 }

In this byte slice to string conversion, Go makes no assumptions about the encoding. To Go, this is just another byte slice, now with the type of string. When the string gets printed by fmt.Println, again, Go is just sending some bytes to standard output. The terminal that outputs the bytes to the screen needs to use the appropriate character encoding for the bytes to render correctly as human-readable text. In this case, either ASCII or UTF-8 encodings would do.

Printing strings

For debugging strings it is often useful to see the raw bytes in different forms. Strings starting with a % symbol, like %s or %q, are placeholders for parameters when passed into certain functions in the fmt package, like fmt.Printf, fmt.Sprintf, which returns a formatted string, and fmt.Scanf, which reads a variable from input. Package fmt implements formatted I/O with functions similar to C’s printf and scanf. The format ‘verbs’ are derived from C’s but are simpler.

The next program showcases some different ways to print strings in Go using the fmt package.

 1 package main
 2 
 3 import (
 4     "fmt"
 5 )
 6 
 7 func showVerb(v, s string) {
 8     n := fmt.Sprintf(v, s)
 9     fmt.Println(v, "\t", n)
10 }
11 
12 func main() {
13     s := "ABC 你好"
14     showVerb("%s", s)
15     showVerb("%q", s)
16     showVerb("%+q", s)
17     showVerb("%x", s)
18     showVerb("% x", s)
19     showVerb("%# x", s)
20 }

Running this produces the following output:

1 %s   ABC 你好
2 %q   "ABC 你好"
3 %+q      "ABC \u4f60\u597d"
4 %x   41424320e4bda0e5a5bd
5 % x      41 42 43 20 e4 bd a0 e5 a5 bd
6 %# x     0x41 0x42 0x43 0x20 0xe4 0xbd 0xa0 0xe5 0xa5 0xbd

The showVerb function is a single line that prints the given verb v, plus the string s formatted using that verb. We use the fmt.Sprintf function to create a new string formatted with the passed in verb v. On the next line we print v and the new string n, to see what it looks like.

The verb following the % character specifies how Go should format the given parameter. There are many verbs to choose from, including some verbs for specific data types. The following verbs can be used for any data type ⁵:

1 %v    the value in a default format
2       when printing structs, the plus flag (%+v) adds field names
3 %#v   a Go-syntax representation of the value
4 %T    a Go-syntax representation of the type of the value
5 %%    a literal percent sign; consumes no value

and these verbs are only for strings and slices of bytes:

1 %s    the uninterpreted bytes of the string or slice
2 %q    a double-quoted string safely escaped with Go syntax
3 %x    base 16, lower-case, two characters per byte
4 %X    base 16, upper-case, two characters per byte

Other common verbs include those for integers (e.g. %d for numbers in base 10) and booleans (%b). You can refer to the fmt package documentation for the full list.

As our example demonstrated, some verbs allow special flags between the % symbol and the verb. From the fmt package documentation:

 1 +   always print a sign for numeric values;
 2     guarantee ASCII-only output for %q (%+q)
 3 -   pad with spaces on the right rather than the left (left-justify the field)
 4 #   alternate format: add leading 0 for octal (%#o), 0x for hex (%#x);
 5     0X for hex (%#X); suppress 0x for %p (%#p);
 6     for %q, print a raw (backquoted) string if strconv.CanBackquote
 7     returns true;
 8     always print a decimal point for %e, %E, %f, %F, %g and %G;
 9     do not remove trailing zeros for %g and %G;
10     write e.g. U+0078 'x' if the character is printable for %U (%#U).
11 ' ' (space) leave a space for elided sign in numbers (% d);
12     put spaces between bytes printing strings or slices in hex (% x, % X)
13 0   pad with leading zeros rather than spaces;
14     for numbers, this moves the padding after the sign

Flags are ignored by verbs that do not expect them.

There are plenty more options for formatting with the fmt package, and we recommend reading the documentation for a full breakdown of all the available options. One aspect not covered so far is that it is possible to specify a width by an optional decimal number immediately preceding the verb. By default the width is whatever is necessary to represent the value, but when provided, width will pad with spaces until there are least that number of runes. This is not the same as in C, which uses bytes to count the width. So what exactly are runes?

Runes and safely ranging over strings

Previously we stated that Go makes no assumptions about the string encoding when it converts bytes to the string type. This is true. The string type carries with it no information about its encoding. But in certain instances, Go does need to make assumptions about the underlying encoding. One such case is when we range over a string.

Let’s return to the last example from the previous chapter, and range over a string that contains some non-ASCII characters:

Iterating over the characters in a string

 1 package main
 2 
 3 import "fmt"
 4 
 5 func ExampleIteration() {
 6 	s := "ABC你好"
 7 	for i, r := range s {
 8 		fmt.Printf("%q(%d) ", r, i)
 9 	}
10 	// Output: 'A'(0) 'B'(1) 'C'(2) '你'(3) '好'(6)
11 }

Running this, we get the following output:

1 'A'(0) 'B'(1) 'C'(2) '你'(3) '好'(6)

The range keyword returns two values on each iteration: an integer indicating the current position in the string, and the current rune. rune is a built-in type in Go, and it is meant to contain a single Unicode character. As such, it is an alias of int32, and contains 4 bytes. So on every iteration, we get the current position in the string, in this case called i, and a rune called r. We use the %q and %d verbs to print out the current rune and position in the string.

By now it might be clear why the position variable i jumps from 3 to 6 between 你 and 好. Under the hood, instead of going byte by byte, the range keyword is fetching the next rune in the string.

Text Normalization

In the Unicode standard, there are often several ways to represent the same string. For example, the acute e in the word café can be represented in a string as a single rune ("\u00e9") or as an e followed by an acute accent ("e\u0301"). According to the standard, these two should be treated as equivalent. Is this what happens in Go?

Basic text normalization in Go

 1 package main
 2 
 3 import (
 4 	"fmt"
 5 
 6 	"golang.org/x/text/unicode/norm"
 7 )
 8 
 9 func main() {
10 	a := string([]rune{0x00E9})
11 	b := string([]rune{0x0065, 0x0301})
12 	fmt.Println(a, b)
13 	fmt.Println("same:", a == b)
14 
15 	normA := norm.NFC.String(a)
16 	normB := norm.NFC.String(b)
17 	fmt.Println(normA, normB)
18 	fmt.Println("same:", normA == normB)
19 
20 	// Output: é é
21 	//same: false
22 	//é é
23 	//same: true
24 }

As the example demonstrates, the answer is “no”. String comparison, by default, does not consider the two versions of é equivalent. However, the example also shows how we can use the golang.org/x/text/unicode/norm package to write the strings into NFC (“Normalization Form C”) before doing the comparison, in which case they do come back as equivalent. Unicode provides four normalization forms: NFD, NFC, NFKD and NFKC. These are summarized in the table below, taken from the Unicode standard documentation:

Form	Description
Normalization Form D (NFD)	Canonical Decomposition
Normalization Form C (NFC)	Canonical Decomposition, followed by Canonical Composition
Normalization Form KD (NFKD)	Compatibility Decomposition
Normalization Form KC (NFKC)	Compatibility Decomposition, followed by Canonical Composition

Here “decomposition” means writing the “e” and accent separately, as in "e\u0301", and “composition” means writing them as a single character, i.e. "\u00e9". The details of these are far beyond the scope of this book, but the important thing for our purposes is that there are several instances where normalizing text can be a good idea. We will discuss a few of these next.

Handling look-alikes

Some Unicode characters look exactly alike, yet use different underlying codes. This can pose a security threat. For example, can you tell the difference between users Kevin and Kevin? One Kevin is using K (unicode \u004B) and the other is using K (Kelvin sign, unicode \u212A). The second Kevin is likely doing this to circumvent detection mechanisms that check whether usernames are taken, and might intend on impersonating the first Kevin.

Fortunately, the golang.org/x/text/unicode/norm package comes to the rescue. We can use this to mitigate (but not completely eliminate) the risk. In the following example, we again norm.NFC.String to convert the strings to canonical form before doing comparisons:

Comparing user names in canonical form

 1 package main
 2 
 3 import (
 4 	"fmt"
 5 
 6 	"golang.org/x/text/unicode/norm"
 7 )
 8 
 9 func equalNorm(a, b string) bool {
10 	return norm.NFC.String(a) == norm.NFC.String(b)
11 }
12 
13 func main() {
14 	user1 := `Kevin` // uses normal K "\u004B"
15 	user2 := `Kevin` // uses Kelvin K "\u212A"
16 	fmt.Println("naive equal:", user1 == user2)
17 	fmt.Println("norm equal:", equalNorm(user1, user2))
18 
19 	// Output: naive equal: false
20 	//norm equal: true
21 }

This trick can help us detect similar usernames that may be an attempt to impersonate someone else and treat visually similar names as equivalent. It’s not perfect, though. Some characters that look the same are still treated as unequal. For example, the Latin o, Greek ο, and Cyrillic о are still different characters because they come from different alphabets, even though they look the same.

A note on efficiency: in the toy examples so far we made use of norm.NFC.String, a convenience function for converting a string to NFC form. This is great for short strings and simple cases. It is, however, not as efficient as the longer form:

1 wc := norm.NFC.Writer(w)
2 defer wc.Close()

Here w needs to be an io.Writer, such as bytes.Buffer. Use this form in applications where the efficiency gains outweigh the brevity of the convenience functions.

Saving bytes on the wire

Generally speaking, the NFC form will not save all that much in terms of bytes stored in your database or sent over the wire. But in some languages, like Korean, the savings can be significant. You should do benchmarking for your own application, but it is well worth considering converting text to NFC form before storing it in a database or sending it to other applications.

Correctly modifying Unicode text

The norm package can also be of use when replacing text. Consider an example, adapted from the official Go blog post, where we wish to replace the word cafe in some text. Without taking multi-rune character boundaries into account, an instance of cafe followed by an accented "e" might become cafeś (with the accent now incorrectly on the s) after replacement. We can handle this if we transform the text to canonical form first before doing replacement.

Replacing text while respecting boundaries between multi-rune characters

 1 package main
 2 
 3 import (
 4 	"fmt"
 5 	"strings"
 6 
 7 	"golang.org/x/text/unicode/norm"
 8 )
 9 
10 func replaceNormalized(text string, old, new []string) string {
11 	for i := 0; i < len(old) && i < len(new); i++ {
12 		text = strings.ReplaceAll(norm.NFC.String(text), old[i], new[i])
13 	}
14 	return text
15 }
16 
17 func main() {
18 	text := "We went to eat at multiple cafe" + string(rune(0x0301)) // accented\
19  e in decomposed form
20 	incorrect := strings.ReplaceAll(text, "cafe", "cafes")
21 	fmt.Println("Incorrect:", incorrect)
22 
23 	correct := replaceNormalized(text, []string{"cafe", "café"}, []string{"cafes\
24 ", "cafés"})
25 	fmt.Println("Correct:",correct)
26 
27 	// Output: Incorrect: We went to eat at multiple cafeś
28 	//Correct: We went to eat at multiple cafés
29 }

Summary

In this chapter we covered:

how Go treats unicode characters in strings differently from other languages. Be sure you test with unicode characters when ranging over strings or counting characters, unless your input guarantees to have only characters from the ASCII subset.
Text normalization and handling look-alike characters, as well as some other edge cases related to Unicode to look out for when writing production-ready Go code.

If you would like to learn more about this topic, also see the Go blog posts Strings, bytes, runes and characters in Go and Text Normalization in Go.

Up next

Concurrency