Benchmarking batch JDBC queries

One of our services recently started to perform multiple inserts and deletes in MySQL database. To the point of a noticeable response time increase. Batching of SQL queries is nothing new, but I decided to wander around this topic a bit in Internet, and stumbled upon something I never heard before (or blissfully forgot). A rewriteBatchedStatements property of MySQL JDBC driver. Here I am, benchmarking this thing and checking other options…

Disclaimer 1. To be honest, this post doesn’t contain anything new or special in it. I’ve just discovered something interesting and would like to share it.

Disclaimer 2. This time I haven’t prepared standalone code example, as I did it for a specific task inside of our bazel monorepo using some of our internal tooling. Sorry. But there’s nothing special in those benchmarks.

What happened?

Initially we used queries like this (all examples are for scalikejdbc but it doesn’t really matter):

case class Entity(id: String, val1: String, val2: Boolean, val3: Array[Byte]) def add(entity: Entry): Unit = autoCommit { implicit session =>
sql"""
INSERT INTO tbl (id, col1, col2, col3)
VALUES (${entity.id}, ${entity.val1}, ${entity.val2}, ${entity.val3})
"""
.update()
.apply()
}
def remove(id: String): Unit = autoCommit { implicit session =>
sql"""DELETE FROM tbl WHERE id = $id"""
.update()
.apply()
}

As system evolved, we started calling those DAO methods in a loop (gotcha!). Eventually number of entities grew, and we got a problem :)

Batching

Batching is an attempt to reduce number of round-trips from application to a database. So, what driver does is — instead of sending a bunch of separate queries, it sends it this way:

INSERT INTO tbl (...) VALUES(...);
INSERT INTO tbl (...) VALUES(...);

Note the ; at the end of the query. Those are still separate queries, but it's sent as a single request. To code change is not that big:

def add(entities: Seq[Entity]): Unit = localTx { implicit session =>
val params = entities.map { entity =>
Seq(
"id" -> entity.id,
"val1" -> entity.val1,
"val2" -> entity.val2,
"val3" -> entity.val3,
)
}
sql"""
INSERT INTO tbl (id, col1, col2, col3)
VALUES ({id}, {val1}, {val2}, {val3})
"""
.batchByName(params: _*)
.apply()
}

Similar thing we can do for DELETE:

def remove(ids: Seq[String]): Unit = localTx { implicit session =>
val params = ids.map(v => Seq(v))
sql"""DELETE FROM tbl WHERE id = ?"""
.batch(params: _*)
.apply()
}

Better DELETE (IN clause)

Actually, for delete we can do much more robust thing:

def remove(ids: Seq[String]): Unit = autoCommit { implicit session =>
val inClause = SQLSyntax.csv(ids.map(id => sqls"""$id"""): _*)
sql"""DELETE FROM tbl WHERE id IN ($inClause)"""
.update()
.apply()
}

rewriteBatchedStatements

I stumbled upon this parameter on StackOverflow and instantly decided to check how it works. Basically, it rewrites INSERT queries in multi-value queries:

INSERT INTO tbl (...)
VALUES (...), (...), (...)

Which makes it slightly more concise (less SQL) and something inside MySQL makes it faster.

Benchmarks

First, I verified that rewriteBatchedStatements actually works by enabling profile logs in driver, it showed that final queries. Second, I built a very simple benchmark that tests different flavors of batching for INSERT and for DELETE with different number of entries. Each entry in my test is about few hundreds of bytes, multiple columns, nothing fancy.

INSERT

As you may see in benchmark results below, multi-value INSERT is really the fastest.

oneByOne is just a loop outside the DAO. Total time increases linearly (obviously). The problem here is simple -- the number of transactions in MySQL equals to number of rows that we're inserting. And a transaction takes time.

Solution to this is to start transaction only once and then make queries in the loop within this transaction. This case represented by oneByOneInTransaction in the table. As you may see, its performance is comparable to a regular batching.

Benchmark    (numberOfEntities)  Mode  Cnt     Score   Error  Unitsbatch                         1  avgt    2    12.797          ms/op
batchRewriting 1 avgt 2 11.769 ms/op
oneByOneInTransaction 1 avgt 2 12.624 ms/op
oneByOne 1 avgt 2 12.184 ms/op
batch 10 avgt 2 13.433 ms/op
batchRewriting 10 avgt 2 11.835 ms/op
oneByOneInTransaction 10 avgt 2 15.592 ms/op
oneByOne 10 avgt 2 125.161 ms/op
batch 100 avgt 2 29.763 ms/op
batchRewriting 100 avgt 2 22.480 ms/op
oneByOneInTransaction 100 avgt 2 35.664 ms/op
oneByOne 100 avgt 2 1281.417 ms/op
batch 1000 avgt 2 213.938 ms/op
batchRewriting 1000 avgt 2 148.009 ms/op
oneByOneInTransaction 1000 avgt 2 229.646 ms/op
batch 10000 avgt 2 2027.138 ms/op
batchRewriting 10000 avgt 2 1497.429 ms/op
oneByOneInTransaction 10000 avgt 2 2321.587 ms/op

DELETE

In case of DELETE the option rewriteBatchedStatements shouldn't affect anything. However, it's still better than regular batching.

As expected, by using IN clause we get the best performance, as it's a single query (just like in case of INSERT with multi-values).

Benchmark  (numberOfEntities)  Mode  Cnt     Score   Error  Units
batch 1 avgt 2 21.636 ms/op
batchRewriting 1 avgt 2 15.237 ms/op
inClause 1 avgt 2 13.483 ms/op
oneByOneInTransaction 1 avgt 2 10.938 ms/op
oneByOne 1 avgt 2 12.273 ms/op
batch 10 avgt 2 16.328 ms/op
batchRewriting 10 avgt 2 14.396 ms/op
inClause 10 avgt 2 11.184 ms/op
oneByOneInTransaction 10 avgt 2 13.085 ms/op
oneByOne 10 avgt 2 124.575 ms/op
batch 100 avgt 2 21.893 ms/op
batchRewriting 100 avgt 2 17.696 ms/op
inClause 100 avgt 2 13.029 ms/op
oneByOneInTransaction 100 avgt 2 24.492 ms/op
oneByOne 100 avgt 2 1181.656 ms/op
batch 1000 avgt 2 104.244 ms/op
batchRewriting 1000 avgt 2 83.070 ms/op
inClause 1000 avgt 2 25.444 ms/op
oneByOneInTransaction 1000 avgt 2 130.383 ms/op
batch 10000 avgt 2 925.338 ms/op
batchRewriting 10000 avgt 2 854.990 ms/op
inClause 10000 avgt 2 167.237 ms/op
oneByOneInTransaction 10000 avgt 2 1254.676 ms/op

Conclusion

The way to achieve the best performance with a database is to use the least amount of queries. In case of INSERT it's a multi-value query, in case of DELETE it's an IN clause with multiple identifiers specified. For DELETE queries it's on us to write it properly, and for INSERT queries there is a very nice driver option that converts your batch query into multi-value query and boosts the performance auto-magically!

Software developer, moved to Israel from Russia, trying to be aware of things.