I’m slowly picking up Go, and ran across a Hacker Rank task: count the number of uppercase letters, in an input, which is written in CamelCase. For example, thisStringWouldYield 3.
It’s easy as it sounds: you go through each character and increase a counter if it’s in uppercase. (playground)
func camelCount(in string) int {
counter := 0
for _, ch := range in {
if (unicode.IsUpper(ch)){
counter++
}
}
return counter
}
… but as regexes, exiting Vim, and character encoding are often a weak point of
many of us, I’ve decided to dig deeper, and see how Go implements IsUpper
. It turned out
to be a good refresher about some basics, which often get overlooked or are taken for granted.
Understanding the problem
Let me throw in a couple of words, and see how many of them we know, and how many of them we think we know: ASCII, ANSI, Unicode, encoding, UTF8, UTF16, code point, rune. I will give a brief description below, but reading Joel’s article is definitely a better way to learn about them:
- ASCII: a set of 128 characters, consisting of the English alphabet, control characters, numbers, punctation marks. A character maps directly to a value in memory:
F -> 01000110 (70)
. Uses 1 byte (8 bits), so this leaves 128 bits up for grabs. This is where ANSI comes in. - ANSI: a set of 256 characters, where the first 128 are from ASCII, and the second part was used for special characters like
ő
andÞ
. As you might imagine, there are more than 128 characters other than English ones, so the upper 128 slots were divided into code pages (Arabic, Greek, etc.). - Unicode: a successful attempt to unify all possible characters in a single character set. The code
U+0048
points to letterH
- we call this acode point
. This is then encoded to memory:H -> U+0048 -utf8-> 110000 (48)
. - encoding: storing a code point in memory. There are hundreds of types, but most of them can represent only a portion of the Unicode character set. This is where UTF-8 comes in.
- UTF-8: an encoding format for the whole Unicode character set, which stores the first 127 characters in 1 byte (8 bits). This makes it compatible with ASCII and ANSI up to the first 127 characters. Characters above this are stored in 2 to 6 bytes.
Back to the task at hand of checking if a character is upper case. “A” is for example encoded to the decimal 41, where lowercase “a” is encoded to 61. At a first glance, it seems like there’s a formula of some sorts to determine the character’s case. The answer is more simple, and more complicated at the same time.
We don’t need any fancy formula, because everything is hardcoded into programming language. There’s a huge hardcoded table of characters, from which we get what we need, and continue with our day.
The “get what we need” is where it gets interesting. Let’s dive into the internals of Go.
I will break down the code examples as much as I can, to assure a firm base for understanding the implementation. This will
include some Go and general programming concepts. Let’s see how isUpper()
works under the hood.
Looking under isUpper()
The implementation is suspiciously short:
func IsUpper(r rune) bool {
if uint32(r) <= MaxLatin1 {
return properties[uint8(r)] & pLmask == pLu
}
return isExcludingLatin(Upper, r)
}
We will focus on understanding line 3:
return properties[uint8(r)] & pLmask == pLu
We’re accessing an array called properties
with an 8-bit integer index, derived from the rune (code point), have some kind of a mask, and pLu
whatever that is. Making sense of both sides of the equality check is up next.
Flags and bitmasks
If we look up the definition of pLmask
(src/unicode/graphic.go), your first reaction might be to switch to the YouTube tab you have sitting next to
the current one, but let’s fight that urge and see what we have here.
const (
pC = 1 << iota // a control character.
...
pLu // an upper-case letter.
pLl // a lower-case letter.
...
pLmask = pLu | pLl // a letter that is neither upper nor lower case.
)
Yes, this is how the Pandora’s box looks like. If you’re fluent in Go and bitwise operations, switch to that YouTube tab until the rest of us finish here.
Q: Why aren’t all the elements defined, if these are constants?
A: They are. The missing type and value just means to repeat the line above as long as it doesn’t encounter an equals sign. A property of Go’s iota is to auto-increment as it moves through the list with each new line.
const (
a = 0 + iota # => 0 + 0 = 0
b # => 0 + 1 = 1
c # => 0 + 2 = 2
)
Q: tries to skim over pC = 1 << iota
A: The elephant at the top, known as a left bitwise shift, is one of the rarer operations you’ll see in your everyday life, and it’s like seeing a full kitchen sink - you reflexively look away. Bitwise operations are usually used to represent flags. Pseudo code, describing ships and a vacation coming up; we’ve arrived to 0’s and 1’s, but it’s fun I promise.
Let’s take the following flags, to help us define our vacationCriteria
:
slipperyDeck = 1 # binary: 0001
onboardPool = 2 # binary: 0010
captainOnBoard = 4 # binary: 0100
freeSnacks = 8 # binary: 1000
The values are not arbitrary: notice that the 1
in the binary representation of the flag is always on a different place, and it never clashes with another one.
If we were looking for a ship with a swimming pool, and we’re into slippery decks, we combine onboardPool
and slipperyDeck
:
vacationCriteria = slipperyDeck | onboardPool # which is equal to: 0001 | 0010 = 0011
Flags are combined with the logical OR
operation, represented with |
. Also our vacationCriteria
is what is called a bitmask. We use bitmasks, to see if a flag is set or not, by using the logical AND
(&
) operation.
Often, this is called extracting the flag
.
Now, we found’ve a potential ship. Let’s check if it fits our needs:
ship = 15 # binary: 1111, pretty flashed out ship
packBags() if (ship & vacationCriteria) # 1111 & 0011 = 0011, which is 'true' so let's packBags()
Conveniently, it also has a captainOnBoard
and freeSnacks
.
When defining the flags, to conserve our brain cycles from moving the binary 1
without clashing with other flags, we use a bitwise left shift <<
. It does exactly we’ve done above when we defined the set of flags.
Combined with iota
(which auto-increments with each line), we always get a new variable, which exactly avoids clashing with the previous one. In other words, with each new line, the 1
in the binary representation of the value, is always one position further from the end than in the previous line.
An interesting property of the left bitwise shift operation is that the end result of `x << y` can be calculated with `x * 2^y`.
Back to the flag definitions in unicode/graphic.go
.
const (
_ = 1 << iota // 1 << 0 = 00001 << 0 => 00000001 (1 * 2^0 = 1)
... // four other flags defined here
pLl // 1 << 5 = 00001 << 5 => 00100000 (1 * 2^5 = 32)
pLu // 1 << 6 = 00001 << 6 => 01000000 (1 * 2^6 = 64)
...
pLmask = pLu | pLl // => 01100000 (32 + 64 = 96)
)
At this point, we’re masters of this snippet.
To make sense of the variable names here: the capital letter refers to Unicode categories: Letter, Character, Space, Other. The ending lowercase letter is, as you probably already guessed, referring to lowercase
or uppercase
.
So pLmask
is a bitmask, which has the uppercase and lowercase flags set to true. By applying a bitwise &
and a character’s properties
,
we can check if the character has either the lowercase or uppercase flag set.
Making sense of it all
We’re now armed to decypher the line from which we started, in 3 quick strokes.
properties[r] & pLmask == pLu
-
properties
is an array which stores flags, like ourvacationCriteria
from before. For example at index 41, you’ll find:properties = [MaxLatin1 + 1]uint8 { ... 0x41: pLu | pp, // 'A' ... }
Conveniently, the position corresponds to the character’s number representation in the Unicode table. With
properties[uint(A)]
we get the value of thepLu | pp
flags.pp
means “printable” and it doesn’t interfere with our check forpLu
, as it’s not present inpLmask
.properties[uint("A")] = pLu | pp // 01000000 | 10000000 => 11000000
So based on the above, we know that the capital “A” is printable and that it’s uppercase. In the sections below, instead of
properties[]
, we will write out(pLu | pp)
, so that the logical operations are more clear. -
The next thing in line is
& pLmask
. With it we check if the value of the givenproperties
element, has thepLmask
flags activated, like we checked if a ship is suitable for our vacation.(pLu | pp) & pLmask // 11000000 & 01100000 => 010000000
Again,
(pLu | pp)
is derived from step 1, and it’s the value ofproperties[uint("A")]
. -
Lastly, we check if the extracted value is exactly
pLu
ie. is the character exactly upcase. It can’t bepLu
andpLl
at the same time. That would make a strange letter.((pLu | pp) & pLmask) == pLu // 010000000 == 010000000, true
Now we can answer the question: is the letter A uppercase?
If a letter has the 'pLu' flag set, it is an uppercase letter.