Scala: String Interpolation Performance
Performance comparison between different kinds of string concatenation/formatting in Java/Scala
UPD: Here is the Pull Request to scala-compiler with changes inspired by this post.
String concatenation is a basic building block in every modern programming language. Many different projects, especially in Web, produce a lot of strings. And it’s interesting, how this problem is solved in different languages. In our case, in Java and Scala (because it’s much easier to compare, and these I use day-to-day).
There are a lot of posts on Internet about this topic. But still, there are something interesting to say about it (I hope you won’t be bored). Yes, this is about performance and microbenchmarking, I know.
Basically, this post is about how to make robust string concatenation (with macros!) and performance comparison. You may skip this big part right to Graphs and Conclusion.
What is string concatenation?
In JVM String object is immutable. If you need to create a new String with a content of two other strings, you need to create a third one. In Java you may concatenate not only string but any objects — Java compiler will convert an object to its string representation automatically. Scala compiler also does it.
What will compiler do, when we write such code in Java:
int a = 1;
String b = "s";
String str = a + b;
This code will be converted to this code:
int a = 1;
String b = "s";
String str = new StringBuilder()
.append(a)
.append(b)
.toString();
Nothing new, it’s well-known thing. But, it’s important to understand, what is happening there:
- new StringBuilder creates a class with a char array of 16 elements.
- append method converts an argument to chars and stores into an internal array. If an array is not big enough, it will create a bigger array.
- toString creates a String object, but char array is not shared between StringBuilder and this new String object (because it’s mutable in StringBuilder).
So, in order to concatenate 2 objects, we need to create an intermediate object, big enough to store the string representation of our 2 objects.
How to improve default concatenation
We have 2 heavy operations from the list above: allocation of a buffer in StringBuilder (and reallocations!) and a creation of String (making copy of StringBuilder’s buffer).
Let’s put some efforts to deal with it. Because I had a lot of experience in .NET, I know, how string concatenation is implemented there.
What’s about C#.NET?
In C# the same problem solved in a slightly different way. Actually, many problems in .NET solved slightly differently.
This concatenation will be converted by the compiler to call to static method Concat (yes, in .NET they use CamelCase instead of camelCase in Java, yet another little difference):
String str = String.Concat(a, b);
And inside Concat we may see, how it works:
public static String Concat(String str0, String str1) {
if (IsNullOrEmpty(str0)) {
if (IsNullOrEmpty(str1)) {
return String.Empty;
}
return str1;
} if (IsNullOrEmpty(str1)) {
return str0;
} int str0Length = str0.Length;
String result = FastAllocateString(str0.Length + str1.Length);
FillStringChecked(result, 0, str0);
FillStringChecked(result, str0Length, str1);
return result;
}
What is interesting here is resulting string length precomputation. They create a char buffer of resulting size, fill it and use in String object without copying. Obviously, this should work much faster (and it’s).
Let’s try to do something similar in Scala.
Optimized string concatenation
So, what do we need is precompute a StringBuilder length to avoid any reallocations.
def concat(o1: Object, o2: Object, o3: Object,
o4: Object, o5: Object, o6: Object): String = {
concatNonNull(orNull(o1), orNull(o2), orNull(o3),
orNull(o4), orNull(o5), orNull(o6))
}private def orNull(o: Object): String =
if (o == null) "null" else o.toStringprivate def concatNonNull(s1: String, s2: String,
s3: String, s4: String,
s5: String, s6: String): String = {
new StringBuilder(s1.length + s2.length + s3.length + s4.length + s5.length + s6.length)
.append(s1)
.append(s2)
.append(s3)
.append(s4)
.append(s5)
.append(s6)
.toString
}
This is a version for 6 arguments (we may generate as many as we want). Precalculation of a StringBuilder buffer size should give us some performance boost (we will see it later).
Fast StringBuilder.toString
Let’s take a look at StringBuilder.toString implementation:
@Override
public String toString() {
// Create a copy, don't share the array
return new String(value, 0, count);
}
And in String’s constructor we see:
public String(char value[], int offset, int count) {
// ... validations omitted
this.value = Arrays.copyOfRange(value, offset, offset+count);
}
Aha! Arrays.copyOfRange creates a new char[] and copy data there. We know for sure, that char array will be filled up and we can use it directly (we’re good ppl, we won’t modify it after string creation). In String class, there is a constructor, that don’t do copy, but… it’s package-private. Meaning, we can’t use it. Or we can?
We may try to use fast reflection (MethodHandle) to access to private constructor! Looks pretty easy:
public static String fastNewString(char[] chars) throws Throwable {
Constructor<String> constructor = String.class
.getDeclaredConstructor(char[].class, boolean.class);
constructor.setAccessible(true);
// it should be cached
MethodHandle newString = MethodHandles.lookup()
.unreflectConstructor(constructor);
return (String) newString.invokeExact(chars, true);
}
Spoiler: it doesn’t work! Details below.
That’s it?
So, that’s it, no more optimizations? Of course not! What else to we do in concatenation? The toString method calls for each argument. This also could be optimized. How?
Imagine, you pass an integer there, which is primitive. In order to pass it, you need to box it and then we call Integer.toString method. What’s wrong with toString? It allocates new memory. StringBuilder has an optimization for appending primitive values, it calculates how many chars needed to print an integer (in order to reallocate internal buffer, if necessary) and then uses internal method Integer.getChars, passing internal char array there. So, there are no additional allocations for printing integer.
Having such generic method, we can’t utilize StringBuilder power to not do unnecessary allocations. But, we’re in Scala world, we may use macroses! The very good explanation is in this StackOverflow topic.
Scala String Interpolation
But before I move to macros implementation of string concatenation, let’s take a look at Scala’s way of string formatting which is called String Interpolation:
val value1 = 1
val value2 = "abc"s"start${value1}middle${value2}end"
This is an equivalent of Java of Scala simple string concatenation, but looks much better:
"start" + value1 + "middle" + value2 + "end"
Scala compiler transforms such interpolation to this code:
new StringContext("start", "middle", "end").s(value1, value2)
Oh no. Again, more allocations? Yes. At least 2: one for an array of string constants and another for an array of arguments. And also, inside method s, there is a processing of string constants (yes, on each call). Doesn’t look as highly optimal.
Macros to rule them all
We may create a macros, that will generate another code for string interpolation. We still can use this neat syntax, but do whatever we want to concatenate these string constants and arguments.
For example, we can precalculate length of StringBuilder, and then append arguments there and be sure, that there won’t be any other allocation.
The code of macros is relatively small (around 100 lines), but it’s hard to demonstrate it well. I will explain what it does. This code:
def value1: Int
def value2: Objectsfi"a${value1}b${value2}c"
Will be transformed to this code:
val local_2 = {
val tmp = value2
if (tmp eq null) "null" else tmp.toString
}
new java.lang.StringBuilder(12 + local_2.length)
.append("a")
.append(value1)
.append("b")
.append(local_2)
.append("c")
.toString
Important parts of generated code:
- sfi stands for “super fast interpolation” :)
- Precalculated length constant — 12 — consists of 3 characters for strings “a”, “b” and “c” and 9 characters allocated for int (value1). It may be not enough (for bigger int), but this is a simplification for now.
- We don’t box integer and don’t call toString, but using append method overload for int. It should be faster.
Obviously, macros could be much richer — we can calculate exactly length of integer (and all primitive values) to not over/under allocate buffer. We may determine whether it’s safe to not create local variables (if it’s already local variable or field) to reduce bytecode size.
But, for testing purposes, I think it’s good enough.
Let’s microbenchmark it!
Basically, I will test this case:
public static String concat(int value1, String value2,
Object nullObject) {
return value1 + "a" + value2 + "b" + value2 + nullObject;
}
With various input arguments, resulting string will be of 7, 17, 29, 75, 212, 1004 and 1006 characters long, which should cover a lot of use cases.
What to test
Besides various input arguments there will be benchmarks for different kinds of concatenation/formatting (all listed in this test):
- Java concatenation, as in code fragment above;
- String.format method;
- java.text.MessageFormat;
- slf4j message formatter (just because it’s a popular logging framework);
- Scala concatenation;
- s string interpolator;
- f string interpolator;
- raw string interpolator;
- optimized concatenation 1 (with length precalculated);
- optimized concatenation 2 (without copying of char array);
- optimized concatenation via string interpolation + macros (soInterpolator);
- super fast interpolation via macros (sfiInterpolator).
Charts and tables
As you can see, optimized concatenation works better on strings longer than 16 characters. String interpolations, based on macros (so and sfi), are the best. Scala’s interpolations (s, f and raw) are slow, its performance lower than regular string concatenation and approximately the same as slf4j.
The more convenient chart is here.
Data table for a resulting string length 212:
Benchmark Score Error Units
javaConcat 165.320 ± 0.578 ns/op
stringFormat 1364.206 ± 19.899 ns/op
messageFormat 1370.831 ± 11.360 ns/op
scalaConcat 169.005 ± 1.481 ns/op
concatOptimized1 94.336 ± 1.419 ns/op
concatOptimized2 131.683 ± 4.288 ns/op
concatOptimizedMacros 68.702 ± 3.256 ns/op
slf4j 275.377 ± 10.165 ns/op
sInterpolator 268.072 ± 12.385 ns/op
fInterpolator 1375.487 ± 10.255 ns/op
rawInterpolator 271.416 ± 1.797 ns/op
sfiInterpolator 58.674 ± 0.562 ns/op
Curious facts
s interpolator is slow
For me it was unexpected. I thought, such generic thing should work as fast as possible, but…
Now, after my research, I understand, why it works so poorly. But I don’t understand, why it wasn’t implemented on compiler level, or at least with macros.
Additionally, I’ve performed couple more tests:
class EmptyStringBenchmark extends BenchmarkBase {
@Benchmark
def baseline: String = {
""
}
@Benchmark
def sInterpolator: String = {
s""
}
@Benchmark
def sfiInterpolator: String = {
import com.komanov.stringformat.macros.MacroConcat._
sfi""
}
}
class ConstStringBenchmark extends BenchmarkBase {
@Benchmark
def baseline: String = {
"abc"
}
@Benchmark
def sInterpolator: String = {
s"abc"
}
@Benchmark
def sfiInterpolator: String = {
import com.komanov.stringformat.macros.MacroConcat._
sfi"abc"
}
}
And the results:
Benchmark Score Error Units
EmptyStringBenchmark.baseline 2.899 ± 0.011 ns/op
EmptyStringBenchmark.sInterpolator 36.181 ± 0.083 ns/op
EmptyStringBenchmark.sfiInterpolator 2.895 ± 0.009 ns/op
ConstStringBenchmark.baseline 2.897 ± 0.008 ns/op
ConstStringBenchmark.sInterpolator 48.307 ± 0.112 ns/op
ConstStringBenchmark.sfiInterpolator 2.892 ± 0.006 ns/op
Besides allocations of arrays, sInterpolator also processes constant strings (non-arguments) to replace special characters. So, for constant string s“a\tb” it will be even worse.
Macros do that processing during compilation time.
Other Scala interpolators
According to documentation, fInterpolator is based on String.format. And its performance actually correlates.
rawInterpolator don’t do any string processing, that’s why it’s slightly better than sInterpolator.
MethodHandle trick is not working
At the beginning, I’ve trusted to StackOverflow response about the performance of MethodHandle. But the only thing I noticed is a performance degradation between concatOptimized1 and concatOptimized2 (StringBuilder.toString won over new String(shared_char_array)).
So, I’ve tested MethodHandle performance by myself, and I got “good” results:
Benchmark Score Error Units
baseline 2.907 ± 0.021 ns/op
fastSb 8.213 ± 0.088 ns/op
fastString 5.129 ± 0.568 ns/op
newString 246.314 ± 3.909 ns/op
sbToString 253.588 ± 7.551 ns/op
fastString is an invocation of private constructor of String and newString is an invocation of public constructor of String, that copies input char array.
But when I tested it on code like:
val sb = new StringBuilder(len1 + len2)
.append(...)
.append(...)
FastStringFactory.fastNewString(sb)
I received performance degradation. Apparently, JVM doesn’t optimized it in this case. Sadly.
But there is a lesson learned: don’t trust in microbenchmarks — verify!
Scala’s StringBuilder is slower
I even wrote a small post about it. Despite the fact, that it’s a wrapper of java.lang.StringBuilder, it’s significantly slower.
Btw, inside sInterpolator Java’s StringBuilder is used. Apparently, they knew it! :)
“They” know that java.lang.StringBuilder is better
Scala concatenation (+ operator) works at the same performance as Java concatenation since 2.12.
And also sInterpolator uses java.lang.StringBuilder inside.
Better performance as I expected for sfiInterpolator
Inside sfiInterpolator implementation, I’ve decided to make it simple, and I am not calculating the precise length for integer and just preallocate 9 characters. And I’ve made a single test with Int.MAX_VALUE, which requires 10 characters. I expected some significant performance degradation since we need to reallocate buffer from 1004 to 2008 chars (yes, I actually verified, that this reallocation happened). But, according to benchmarks, there wasn’t any degradation. Which is really strange.
Conclusion
Performance optimization is always a fun journey (at least for me): you read code, think about it, try to write an alternative solution, measure, analyze. But, it’s not fun to realize, that such basic things aren’t optimized as it could be. I don’t know the motivation (maybe just lack of time), but I’m disappointed a bit.
Anyway, Scala has a powerful tool (macros), and it could be fixed.