ASCII String to Lowercase in Go
As the documentation states, Go's strings
package is for the manipulation of UTF-8 encoded strings. This might make one wonder how an ASCII-specific implementation would compare.
To start with, let's write an ASCII-only lowercase
function. This will iterate through the input, converting an uppercase ASCII letter (where the value is between 'A' and 'Z') and lowercasing it by adding 32 (or 'a' - 'A') to it:
func lowercase(input string) string {
lower := make([]byte, len(input))
for i := 0; i < len(input); i++ {
c := input[i]
if 'A' <= c && c <= 'Z' {
c += 32
}
lower[i] = c
}
// if you think this is unfair, note that strings.Builder
// does the exact same thing
return *(*string)(unsafe.Pointer(&lower))
}
We can tell from our lowercase
function that the performance will be linear based on the length of the input and regardless of what the input is (whether it be all lowercase or all uppercase or a mix of the two). But without looking into strings.ToLower
we don't know how it behaves across different lengths and types of inputs, so we'll test 4 different strings:
var A = strings.Repeat("A", 4096)
var a = strings.ToLower(A)
var cases = []struct {
input string
expected string
name string
}{
{"jpeg", "jpeg", "short no transform"},
{"JPEG", "jpeg", "short with transform"},
{a, a, "long no transform"},
{A, a, "long with transform"},
}
Above we have the test cases for a short and long string, which both come in an already-lowercase version and an uppercase version. There's nothing fancy about our benchmark:
func Benchmark_Stdlib(b *testing.B) {
for _, c := range cases {
b.Run(fmt.Sprintf("stdlib_%s", c.name), func(b *testing.B) {
for i := 0; i < b.N; i++ {
if actual := strings.ToLower(c.input); actual != c.expected {
panic(actual)
}
}
})
}
}
func Benchmark_Custom(b *testing.B) {
for _, c := range cases {
b.Run(fmt.Sprintf("custom_%s", c.name), func(b *testing.B) {
for i := 0; i < b.N; i++ {
if actual := lowercase(c.input); actual != c.expected {
panic(actual)
}
}
})
}
}
And the results:
stdlib_short_no_transform 6.303 ns/op 0 B/op 0 allocs/op
custom_short_no_transform 11.39 ns/op 4 B/op 1 allocs/op
stdlib_short_with_transform 19.98 ns/op 4 B/op 1 allocs/op
custom_short_with_transform 11.39 ns/op 4 B/op 1 allocs/op
stdlib_long_no_transform 2837 ns/op 0 B/op 0 allocs/op
custom_long_no_transform 2450 ns/op 4096 B/op 1 allocs/op
stdlib_long_with_transform 10731 ns/op 4096 B/op 1 allocs/op
custom_long_with_transform 2457 ns/op 4096 B/op 1 allocs/op
With no insight into the standard library's implementation, these results are quite interesting. The standard library is much faster with short strings which are already lowercase, but much slower for long strings that must be converted. In the other two cases, it's a little closer, but still the edge goes to our custom function.
As we predicted, the performance (and memory) characteristic of our function is easy to reason about: the two short versions perform the same as each other, as do the two long versions. The performance of strings.ToLower
is a little bit more varied. In particular, we see that the variations that require no conversion perform considerably better than the variations that do. Why is that? Well, one big hint is that both fast cases allocate no memory.
It turns out that strings.ToLower
first scans the entire input to determine two things: 1) does the string contain only ASCII characters and 2) does it contain any uppercase characters. Understanding this detail explains all of the performance differences between the two versions. First of all, the standard library starts off with a performance disadvantage by needing to scan the entire string once (which explains why the performance for long strings is so much worse). However, by doing this scan, it can detect that the string is already lowercase and simply return the input (which explains why the tests with lowercase input is so fast and doesn't allocate any memory).
The logic for lowercasing an ASCII string is much simpler and efficient than an UTF-8 string. But because strings.ToLower
supports UTF-8, the only solution is to scan the string to determine whether the faster ASCII logic can be used. Cleverly, since you're scanning the string anyways, you might as well check if anything even needs to change. The issue that resulted in this change can be seen here.
Our story isn't quite done yet. Knowing what the standard library does, is there anything we can do to improve the performance of our custom function, specifically with respect to our worst case performance: short strings that are already lowercase? At first glance, it's hard to see a solution other than what the standard library does, namely scanning the string first and then scanning it again, if necessary, to lowercase it. But upon further reflection, the two loops can be combined into one. That is, we can scan the input up until we detect our first uppercase character, at which point we can begin our conversion within the same loop
func fancyLowercase(input string) string {
for i := 0; i < len(input); i++ {
c := input[i]
if 'A' <= c && c <= 'Z' {
// We've found an uppercase character, we'll need to convert this string
lower := make([]byte, len(input))
// copy everything we've skipped over up to this point
copy(lower, input[:i])
// our current character needs to be uppercase (it's the reason we're
// in this branch)
lower[i] = c + 32
// now iterate over the rest of the input, from where we are, knowing that
// we need to copy/lower case into our lowercase strinr
for j := i + 1; j < len(input); j++ {
c := input[j]
if 'A' <= c && c <= 'Z' {
c += 32
}
lower[j] = c
}
// if you think this is unfair, note that strings.Builder
// does the exact same thing
return *(*string)(unsafe.Pointer(&lower))
}
}
// input was already lowercase, return it as-is
return input
}
And the results:
stdlib_short_no_transform 6.305 ns/op 0 B/op 0 allocs/op
custom_short_no_transform 11.40 ns/op 4 B/op 1 allocs/op
fancy_short_no_transform 5.885 ns/op 0 B/op 0 allocs/op
stdlib_short_with_transform 20.45 ns/op 4 B/op 1 allocs/op
custom_short_with_transform 11.39 ns/op 4 B/op 1 allocs/op
fancy_short_with_transform 12.24 ns/op 4 B/op 1 allocs/op
stdlib_long_no_transform 2884 ns/op 0 B/op 0 allocs/op
custom_long_no_transform 2441 ns/op 4096 B/op 1 allocs/op
fancy_long_no_transform 1646 ns/op 0 B/op 0 allocs/op
stdlib_long_with_transform 10781 ns/op 4096 B/op 1 allocs/op
custom_long_with_transform 2448 ns/op 4096 B/op 1 allocs/op
fancy_long_with_transform 2754 ns/op 4096 B/op 1 allocs/op
Our new implementation is a good mix of the two other versions. Like strings.ToLower
it's able to avoid unnecessary allocations for input which are already lowercase. And, like our custom lowercase
function, it only ever does a single iteration. We're always faster than strings.ToLower
(because we're able to assume the string is ASCII) and when a conversion is needed, we're only a little bit slower than our simpler lowecase
version.
Additional Notes
The above is all I wanted to say about lowercasing, specifically because I was looking at ASCII-only text manipulation. But I did notice two small details in the standard library that I wanted to point out. The first relates to UTF-8 conversion using strings.ToLower
. If we know that our input is UTF-8, then again, strings.ToLower
is doing an unnecessary iteration over our input (to determine if it's ASCII only):
var input = strings.Repeat("A", 4096) + "☺"
var expected = strings.ToLower(input)
func Benchmark_ToLower(b *testing.B) {
for i := 0; i < b.N; i++ {
if actual := strings.ToLower(input); actual != expected {
panic(actual)
}
}
}
func Benchmark_Map(b *testing.B) {
for i := 0; i < b.N; i++ {
if actual := strings.Map(unicode.ToLower, input); actual != expected {
panic(actual)
}
}
}
When strings.ToLower
detects a UTF-8 input, it calls strings.Map(unicode.ToLower)
. The result of calling this directly is obviously going to be faster (note that our input is intentionally exploiting the worst-case behaviour):
ToLower 11208 ns/op 4864 B/op 1 allocs/op
Map 8985 ns/op 4864 B/op 1 allocs/op
It's a small difference, but it might be of interest in cases where UTF-8 input is guaranteed.
The other small/random thing I observed is that there's a ToLower
function inside the net/http/internal/ascii
package which first iterates through the string to determine if each character is printable. If each character is printable, strings.ToLower
is called. This means that this function iterates over the input string 3 times. That seems unfortunate, though I assume it's only used for short strings.